diff --git a/README.md b/README.md
index 8e91865fa..9c69caecf 100644
--- a/README.md
+++ b/README.md
@@ -19,9 +19,11 @@ This repository contains release versions of the Georgetown University Multilaye
The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: https://gucorpling.org/gum.
-## A note about reddit data
+## A note about Reddit data
-For one of the twelve text types in this corpus, reddit forum discussions, plain text data is not supplied. To obtain this data, please run `_build/process_reddit.py`, then `run _build/build_gum.py`. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and reddit data is subject to reddit's terms and conditions. See [README_reddit.md](README_reddit.md) for more details.
+For one of the twelve text types in this corpus, Reddit forum discussions, plain text data is not supplied, and you will find **underscores** in place of word forms in documents from this data (files named `GUM_reddit_*`). To obtain this data, please run `python get_text.py`, which will allow you to reconstruct the text in these files. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and Reddit data is subject to Reddit's terms and conditions. See [README_reddit.md](README_reddit.md) for more details.
+
+Note that the `get_text.py` script only regenerates the files named `GUM_reddit_*` in each folder, and will not create full versions of the data in `PAULA/` and `annis/`. If you require PAULA XML or searchable ANNIS data containing these documents, you will need to recompile the corpus from the source files under `_build/src/`. To do this, run `_build/process_reddit.py`, then run `_build/build_gum.py`.
## Train / dev / test splits
@@ -78,16 +80,21 @@ For a full list of contributors please see [the corpus website](https://gucorpli
The corpus is downloadable in multiple formats. Not all formats contain all annotations: The most accessible format is probably CoNLL-U dependencies (in `dep/`), but the most complete XML representation is in [PAULA XML](https://www.sfb632.uni-potsdam.de/en/paula.html), and the easiest way to search in the corpus is using [ANNIS](http://corpus-tools.org/annis). Here is [an example query](https://gucorpling.org/annis/#_q=ZW50aXR5IC0-YnJpZGdlIGVudGl0eSAmICMxIC0-aGVhZCBsZW1tYT0ib25lIg&_c=R1VN&cl=5&cr=5&s=0&l=10) for phrases headed by 'one' bridging back to a different, previously mentioned entity. Other formats may be useful for other purposes. See website for more details.
-**NB: reddit data is not included in top folders - consult README_reddit.md to add it**
+**NB: Reddit data in top folders does not inclulde the base text forms - consult README_reddit.md to add it**
* _build/ - The [GUM build bot](https://gucorpling.org/gum/build.html) and utilities for data merging and validation
* annis/ - The entire merged corpus, with all annotations, as a relANNIS 3.3 corpus dump, importable into [ANNIS](http://corpus-tools.org/annis)
* const/ - Constituent trees with function labels and PTB POS tags in the PTB bracketing format (automatic parser output from gold POS with functions projected from gold dependencies)
* coref/ - Entity and coreference annotation in two formats:
* conll/ - CoNLL shared task tabular format (with Wikification but no bridging or split antecedent annotations)
+ * tsv/ - WebAnno .tsv format, including entity type, salience and information status annotations, Wikification, bridging, split antecedent and singleton entities
* ontogum/ - alternative version of coreference annotation in CoNLL, tsv and CoNLL-U formats following OntoNotes guidelines (see Zhu et al. 2021)
- * tsv/ - WebAnno .tsv format, including entity and information status annotations, Wikification, bridging, split antecedent and singleton entities
- * dep/ - Dependency trees using Universal Dependencies, enriched with sentence types, enhanced dependencies, entities, information status, coreference, bridging, Wikification, XML markup, morphological tags and Universal POS tags according to the UD standard
+ * dep/ - Dependency trees using Universal Dependencies, enriched with metadata, summaries, sentence types, speaker information, enhanced dependencies, entities, information status, salience, centering, coreference, bridging, Wikification, XML markup, morphological tags and Universal POS tags according to the UD standard
* paula/ - The entire merged corpus in standoff [PAULA XML](https://github.com/korpling/paula-xml), with all annotations
- * rst/ - Rhetorical Structure Theory analyses in .rs3 format as used by RSTTool and rstWeb, as well as binary and n-ary lisp trees (.dis) and an RST dependency representation (.rsd)
- * xml/ - vertical XML representations with 1 token or tag per line and tab delimited lemmas and POS tags (extended VVZ style, vanilla, UPOS and CLAWS5, as well as dependency functions), compatible with the IMS Corpus Workbench (a.k.a. TreeTagger format).
+ * rst/ - Rhetorical Structure Theory analyses
+ * rstweb/ - full .rs3 format data as used by RSTTool and rstWeb (recommended)
+ * lisp_nary/ - n-ary lisp trees (.dis format)
+ * lisp_binary/ - binarized lisp trees (.dis format)
+ * dependencies/ - a converted RST dependency representation (.rsd format)
+ * disrpt/ - plain segmentation and relation-per-line data formats following the DISRPT shared task specification
+ * xml/ - vertical XML representations with 1 token or tag per line, metadata, summaries and tab delimited lemmas and POS tags (extended VVZ style, vanilla, UPOS and CLAWS5, as well as dependency functions), compatible with the IMS Corpus Workbench (a.k.a. TreeTagger format).
diff --git a/README_reddit.md b/README_reddit.md
index ae9bb4bf0..90c8b1a46 100644
--- a/README_reddit.md
+++ b/README_reddit.md
@@ -1,25 +1,24 @@
-# Data from reddit
+# Data from Reddit
-For one of the text types in this corpus, reddit forum discussions, plain text data is not supplied in this repository. To obtain this data, please follow the instructions below.
+For one of the text types in this corpus, Reddit forum discussions, plain text data is not supplied in this repository. To obtain this data, please follow the instructions below.
## Annotations
-Documents in the reddit subcorpus are named GUM_reddit_* (e.g. GUM_reddit_superman) and are *not* included in the root folder with all annotation layers. The annotations for the reddit subcorpus can be found together with all other document annotations in `_build/src/`. Token representations in these files are replaced with underscores, while the annotations themselves are included in the files. To compile the corpus including reddit data, you must obtain the underlying texts.
+Documents in the Reddit subcorpus are named `GUM_reddit_*` (e.g. GUM_reddit_superman) and are included in the root folder with all annotation layers but with underscores instead of text. To compile the corpus including Reddit data, you must obtain the underlying texts, and either regenerate the files in the top level folders (works for all formats except `PAULA` and `annis`), or rebuild the corpus (see below).
-## Obtaining underlying reddit text data
+## Obtaining underlying Reddit text data
-To recover reddit data, use the API provided by the script `_build/process_reddit.py`. If you have your own credentials for use with the Python reddit API wrapper (praw) and Google bigquery, you should include them in two files, `praw.txt` and `key.json` in `_build/utils/get_reddit/`. For this to work, you must have the praw and bigquery libraries installed for python (e.g. via pip). You can then run `python _build/process_reddit.py` to recover the data, and proceed to the next step, re-building the corpus.
+To recover Reddit data, use the API provided by the Python script `get_text.py`, which will restore text in all top-level folders except for `PAULA` and `annis`. If you do not have credentials for the Python Reddit API wrapper (praw) and Google bigquery, the script can attempt to download data for you from a proxy. Otherwise you can also use your own credentials for praw etc. and include them in two files, `praw.txt` and `key.json`. For this to work, you must have the praw and bigquery libraries installed for python (e.g. via pip).
-Alternatively, if you can't use praw/bigquery, the script `_build/process_reddit.py` will offer to download the data for you by proxy. To do this, run the script and confirm that you will only use the data according to the terms and conditions determined by reddit, and for non-commercial purposes. The script will then download the data for you - if the download is successful, you can continue to the next step and re-build the corpus.
+If you also require the `PAULA` and `annis` formats, you must rebuild the corpus from `_build/src/`. To do this, run `_build/process_reddit.py`, which again requires either running a proxy download or using your own credentials and placing them in `_build/utils/get_reddit/`. Once the download completes successfully, you will need to rebuild the corpus as explained in the next step.
-## Rebuilding the corpus with reddit data
+## Rebuilding the corpus with Reddit data
-To compile all projected annotations and produce all formats not included in `_build/src/`, you will need to run the GUM build bot: `python _build/build_gum.py`. This process is described in detail at https://gucorpling.org/gum/build.html, but summarized instructions follow.
+To compile all projected annotations and produce all formats not included in `_build/src/`, you will need to run the GUM build bot: `python build_gum.py` in `_build/`. This process is described in detail at https://gucorpling.org/gum/build.html, but summarized instructions follow.
-At a minumum, you can run `python _build/build_gum.py` with no options. This will produce basic formats in `_build/target/`, but skip generating fresh constituent parses, CLAWS5 tags and the Universal Dependencies version of the dependency data. To include these you will need:
+At a minumum, you can run `build_gum.py` with no options. This will produce basic formats in `_build/target/`, but skip generating fresh constituent parses and CLAWS5 tags. To include these you will need:
* CLAWS5: use option -c and ensure that utils/paths.py points to an executable for the TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). The CLAWS 5 parameter file is already included in utils/treetagger/lib/, and tags are auto-corrected by the build bot based on gold PTB tags.
* Constituent parses: option -p; ensure that paths.py correctly points your installation of the Stanford Parser/CoreNLP
- * Universal Dependencies: option -u; ensure the paths.py points to CoreNLP, and that you have installed udapi and depedit (pip install udapi; pip install depedit). Note that this only works with Python 3.
-If you run into problems building the corpus, feel free to report an issue via GitHub or contact us via e-mail.
\ No newline at end of file
+After the build bot runs, data including `PAULA` and `annis` versions will be generated in the specified `target/` folder. If you run into problems building the corpus, feel free to report an issue via GitHub or contact us via e-mail.
\ No newline at end of file
diff --git a/_build/utils/get_reddit/underscores.py b/_build/utils/get_reddit/underscores.py
index cd6d25bdc..70e86ea40 100644
--- a/_build/utils/get_reddit/underscores.py
+++ b/_build/utils/get_reddit/underscores.py
@@ -1,4 +1,6 @@
import os, glob, re, io, sys
+from collections import defaultdict
+from copy import deepcopy
PY3 = sys.version_info[0] == 3
@@ -10,16 +12,21 @@ def deunderscoring(src_folder, textdic):
make_text_const(src_folder + "const" + os.sep, textdic)
-def make_text(folder, textdic, tok_col, lemma_col=None, unescape_xml=False):
+def make_text(folder, textdic, tok_col, lemma_col=None, unescape_xml=False, docs2lemmas=None, docs2tokens=None):
files_to_process = glob.glob(folder + "GUM_reddit*")
print("o Processing " + str(len(files_to_process)) + " files in " + folder + "...")
+ lemma_dict = defaultdict(list)
+ token_dict = defaultdict(list)
+ docs2tokens_copy = deepcopy(docs2tokens)
+ docs2lemmas_copy = deepcopy(docs2lemmas)
for f_path in files_to_process:
with io.open(f_path, 'r', encoding='utf-8') as fin:
in_lines = fin.read().replace("\r","").split("\n")
- tokens = textdic[os.path.basename(f_path)[:os.path.basename(f_path).find(".")]]
+ docname = os.path.basename(f_path)[:os.path.basename(f_path).find(".")]
+ tokens = textdic[docname]
if unescape_xml:
tokens = tokens.replace(">",">").replace("<","<").replace("&","&")
else:
@@ -27,26 +34,45 @@ def make_text(folder, textdic, tok_col, lemma_col=None, unescape_xml=False):
tokens = tokens.replace("&","&")
tokens = tokens.replace(">",">").replace("<","<")
if not PY3:
- tokens = tokens.decode("utf8")
+ tokens = tokens.decode("utf8")
+ text_tokens = list(tokens)
with io.open(f_path, 'w', encoding='utf-8', newline="\n") as fout:
for i, line in enumerate(in_lines):
if line.startswith('<'):
fout.write(line+"\n")
+ elif line.startswith("#") and "Text=" in line or "text =" in line:
+ restored = [line.split("=",1)[0] + "="]
+ for c in line.split("=",1)[1]:
+ if c != " ":
+ restored.append(text_tokens.pop(0))
+ else:
+ restored.append(c)
+ fout.write("".join(restored)+"\n")
elif "\t" in line:
elements = line.split('\t')
- elements[tok_col] = tokens[:len(elements[tok_col])]
- tokens = tokens[len(elements[tok_col]):]
- #if not unescape_xml:
- # elements[tok_col] = elements[tok_col].replace("&","&").replace("&","&")
- if lemma_col is not None:
- if elements[lemma_col] == '_':
- if not (elements[tok_col] in ["hearing","hind"] and "_card" in f_path): # Check known goeswith cases
- elements[lemma_col] = elements[tok_col]
- else:
- elements[lemma_col] = "_"
- elif elements[lemma_col] == "*LOWER*":
- elements[lemma_col] = elements[tok_col].lower()
+ if not (len(elements) == 10 and len(elements[-1]) >0 and ("." in elements[0] or "-" in elements[0])):
+ elements[tok_col] = tokens[:len(elements[tok_col])]
+ token_dict[docname].append(elements[tok_col])
+ tokens = tokens[len(elements[tok_col]):]
+ #if not unescape_xml:
+ # elements[tok_col] = elements[tok_col].replace("&","&").replace("&","&")
+ if lemma_col is not None:
+ if elements[lemma_col] == '_':
+ if not (elements[tok_col] in ["hearing","hind"] and "_card" in f_path): # Check known goeswith cases
+ elements[lemma_col] = elements[tok_col]
+ else:
+ elements[lemma_col] = "_"
+ elif elements[lemma_col] == "*LOWER*":
+ elements[lemma_col] = elements[tok_col].lower()
+ lemma_dict[docname].append(elements[lemma_col])
+ if docs2lemmas is not None: # Reconstruct lemmas for conllu
+ if "." not in elements[0] and "-" not in elements[0]:
+ elements[2] = docs2lemmas_copy[docname].pop(0)
+ docs2tokens_copy[docname].pop(0)
+ elif "-" in elements[0]: # Conllu MWT
+ elements[1] = docs2tokens_copy[docname][0]
+ elements[1] += docs2tokens_copy[docname][1]
try:
fout.write('\t'.join(elements)+"\n")
except Exception as e:
@@ -58,10 +84,11 @@ def make_text(folder, textdic, tok_col, lemma_col=None, unescape_xml=False):
fout.write("\n")
else:
fout.write(unicode("\n"))
+ return lemma_dict, token_dict
-def make_text_rst(folder, textdic):
- files_to_process = glob.glob(folder + "GUM_reddit*.rs3")
+def make_text_rst(folder, textdic, unescape_xml=False, extension="rs3", edu_regex=r'(.*]*>)(.*)()'):
+ files_to_process = glob.glob(folder + "GUM_reddit*." + extension)
print("o Processing " + str(len(files_to_process)) + " files in "+folder+"...")
# Delete tokens in .xml files
@@ -70,10 +97,12 @@ def make_text_rst(folder, textdic):
tokens = textdic[os.path.basename(f_path)[:os.path.basename(f_path).find(".")]]
if not PY3:
tokens = tokens.decode("utf8")
- if "&" in tokens and not "&" in tokens and not "_ring" in f_path: # Some bigquery entries have no &
- tokens = tokens.replace("&","&")
- tokens = tokens.replace(">",">").replace("<","<") # Reddit API does not escape lt/gt, but does escape &
-
+ if unescape_xml:
+ tokens = tokens.replace(">",">").replace("<","<").replace("&","&")
+ else:
+ if "&" in tokens and not "&" in tokens and not "_ring" in f_path: # Some bigquery entries have no &
+ tokens = tokens.replace("&", "&")
+ tokens = tokens.replace(">", ">").replace("<","<") # Reddit API does not escape lt/gt, but does escape &
with io.open(f_path, 'r', encoding='utf-8') as fin:
in_lines = fin.read().replace("\r","").split("\n")
@@ -81,10 +110,10 @@ def make_text_rst(folder, textdic):
with io.open(f_path, 'w', encoding='utf-8', newline="\n") as fout:
cursor = 0
for i, line in enumerate(in_lines):
- if "]*>)(.*)()',line)
+ m = re.search(edu_regex,line)
pre = m.group(1)
seg = m.group(2)
post = m.group(3)
@@ -113,8 +142,8 @@ def underscoring(src_folder):
make_underscores_const(src_folder + "const" + os.sep)
-def make_underscores_rst(folder):
- files_to_process = glob.glob(folder + "GUM_reddit*.rs3")
+def make_underscores_rst(folder, extension="rs3", edu_regex=r'(.*]*>)(.*)()'):
+ files_to_process = glob.glob(folder + "GUM_reddit*." + extension)
print("o Processing " + str(len(files_to_process)) + " files in "+folder+"...")
# Delete tokens in .xml files
@@ -125,10 +154,10 @@ def make_underscores_rst(folder):
with io.open(f_path, 'w', encoding='utf-8', newline="\n") as fout:
for i, line in enumerate(in_lines):
- if "]*>)(.*)()',line)
+ m = re.search(edu_regex,line)
pre = m.group(1)
seg = m.group(2)
post = m.group(3)
@@ -156,11 +185,12 @@ def make_underscores(folder, tok_col, lemma_col=None):
for i, line in enumerate(in_lines):
if line.startswith('<'):
fout.write(line + "\n")
- elif line.startswith("#Text="):
+ elif line.startswith("#Text=") or line.startswith("# text ="):
+ underscored_text = line.split("=",1)[0] + "=" + re.sub(r'[^\s]','_',line.split("=",1)[1])
if PY3:
- fout.write("#Text=_" + "\n")
+ fout.write(underscored_text + "\n")
else:
- fout.write(unicode("#Text=_" + "\n"))
+ fout.write(unicode(underscored_text + "\n"))
elif "\t" in line:
#line = line.replace("&","&")
elements = line.split('\t')
diff --git a/_build/utils/get_reddit/underscores_disrpt.py b/_build/utils/get_reddit/underscores_disrpt.py
new file mode 100644
index 000000000..7d1a82016
--- /dev/null
+++ b/_build/utils/get_reddit/underscores_disrpt.py
@@ -0,0 +1,264 @@
+"""
+process_underscores.py
+
+Script to handle licensed data for which underlying text cannot be posted online (e.g. LDC data).
+Users need a copy of the LDC distribution of an underlying resource to restore text in some of the corpora.
+
+
+"""
+
+__author__ = "Amir Zeldes"
+__license__ = "Apache 2.0"
+__version__ = "2.0.0"
+
+import io, re, os, sys
+from glob import glob
+from collections import defaultdict
+script_dir = os.path.dirname(os.path.realpath(__file__)) + os.sep
+
+PY3 = sys.version_info[0] == 3
+
+
+def underscore_files(filenames):
+ def underscore_rel_field(text):
+ blanked = []
+ text = text.replace("<*>","❤")
+ for c in text:
+ if c!="❤" and c!=" ":
+ blanked.append("_")
+ else:
+ blanked.append(c)
+ return "".join(blanked).replace("❤","<*>")
+
+ if isinstance(filenames,str):
+ filenames = glob(filenames + "*.*")
+ for f_path in filenames:
+ skiplen = 0
+ with io.open(f_path, 'r', encoding='utf8') as fin:
+ lines = fin.readlines()
+
+ with io.open(f_path, 'w', encoding='utf8', newline="\n") as fout:
+ output = []
+ if f_path.endswith(".rels"):
+ for l, line in enumerate(lines):
+ line = line.strip()
+ if "\t" in line and l > 0:
+ doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label = line.split("\t")
+ if "GUM" in doc and "reddit" not in doc:
+ output.append(line)
+ continue
+ unit1_txt = underscore_rel_field(unit1_txt)
+ unit2_txt = underscore_rel_field(unit2_txt)
+ unit1_sent = underscore_rel_field(unit1_sent)
+ unit2_sent = underscore_rel_field(unit2_sent)
+ fields = doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label
+ line = "\t".join(fields)
+ output.append(line)
+ else:
+ doc = ""
+ for line in lines:
+ line = line.strip()
+ if line.startswith("# newdoc id"):
+ doc = line.split("=",maxsplit=1)[1].strip()
+ if "GUM" in doc and "reddit" not in doc:
+ output.append(line)
+ continue
+ if line.startswith("# text"):
+ m = re.match(r'(# text ?= ?)(.+)',line)
+ if m is not None:
+ line = m.group(1) + re.sub(r'[^\s]','_',m.group(2))
+ output.append(line)
+ elif "\t" in line:
+ fields = line.split("\t")
+ tok_col, lemma_col = fields[1:3]
+ if lemma_col == tok_col: # Delete lemma if identical to token
+ fields[2] = '_'
+ elif tok_col.lower() == lemma_col:
+ fields[2] = "*LOWER*"
+ if skiplen < 1:
+ fields[1] = len(tok_col)*'_'
+ else:
+ skiplen -=1
+ output.append("\t".join(fields))
+ if "-" in fields[0]: # Multitoken
+ start, end = fields[0].split("-")
+ start = int(start)
+ end = int(end)
+ skiplen = end - start + 1
+ else:
+ output.append(line)
+ fout.write('\n'.join(output) + "\n")
+
+
+def restore_docs(path_to_underscores,text_dict):
+ def restore_range(range_string, underscored, tid_dict):
+ output = []
+ tok_ids = []
+ range_strings = range_string.split(",")
+ for r in range_strings:
+ if "-" in r:
+ s, e = r.split("-")
+ tok_ids += list(range(int(s),int(e)+1))
+ else:
+ tok_ids.append(int(r))
+
+ for tok in underscored.split():
+ if tok == "<*>":
+ output.append(tok)
+ else:
+ tid = tok_ids.pop(0)
+ output.append(tid_dict[tid])
+ return " ".join(output)
+
+ dep_files = glob(path_to_underscores+os.sep+"*.conllu")
+ tok_files = glob(path_to_underscores+os.sep+"*.tok")
+ rel_files = glob(path_to_underscores+os.sep+"*.rels")
+ skiplen = 0
+ token_dict = {}
+ tid2string = defaultdict(dict)
+ for file_ in dep_files + tok_files + rel_files:
+ lines = io.open(file_,encoding="utf8").readlines()
+ underscore_len = 0 # Must match doc_len at end of file processing
+ doc_len = 0
+ if file_.endswith(".rels"):
+ output = []
+ violation_rows = []
+ for l, line in enumerate(lines):
+ line = line.strip()
+ if l > 0 and "\t" in line:
+ fields = line.split("\t")
+ docname = fields[0]
+ text = text_dict[docname]
+ if "GUM_" in docname and "reddit" not in docname: # Only Reddit documents need reconstruction in GUM
+ output.append(line)
+ continue
+ doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label = line.split("\t")
+ underscore_len += unit1_txt.count("_") + unit2_txt.count("_") + unit1_sent.count("_") + unit2_sent.count("_")
+ if underscore_len == 0:
+ #sys.stderr.write("! Non-underscored file detected - " + os.path.basename(file_) + "\n")
+ print("! DISRPT format alreadt restored in " + os.path.basename(file_) + "\n")
+ sys.exit(0)
+ unit1_txt = restore_range(unit1_toks, unit1_txt, tid2string[docname])
+ unit2_txt = restore_range(unit2_toks, unit2_txt, tid2string[docname])
+ unit1_sent = restore_range(s1_toks, unit1_sent, tid2string[docname])
+ unit2_sent = restore_range(s2_toks, unit2_sent, tid2string[docname])
+ plain = unit1_txt + unit2_txt + unit1_sent + unit2_sent
+ plain = plain.replace("<*>","").replace(" ","")
+ doc_len += len(plain)
+ fields = doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label
+ line = "\t".join(fields)
+ if doc_len != underscore_len and len(violation_rows) == 0:
+ violation_rows.append(str(l) + ": " + line)
+ output.append(line)
+
+ else:
+ tokfile = True if ".tok" in file_ else False
+ output = []
+ parse_text = ""
+ docname = ""
+ for line in lines:
+ line = line.strip()
+ if "# newdoc id " in line:
+ tid = 0
+ if parse_text !="":
+ if not tokfile:
+ token_dict[docname] = parse_text
+ parse_text = ""
+ docname = re.search(r'# newdoc id ?= ?([^\s]+)',line).group(1)
+ if "GUM" in docname and "reddit" not in docname:
+ output.append(line)
+ continue
+ if docname not in text_dict:
+ raise IOError("! Text for document name " + docname + " not found.\n Please check that your LDC data contains the file for this document.\n")
+ if ".tok" in file_:
+ if docname not in token_dict: # Fetch continuous token string from conllu
+ parse_conllu = open(os.sep.join([script_dir,"..","..","..","dep",docname + ".conllu"])).read()
+ toks = [l.split("\t") for l in parse_conllu.split("\n") if "\t" in l]
+ toks = [l[1] for l in toks if "-" not in l[0] and "." not in l[0]]
+ toks = "".join(toks)
+ token_dict[docname] = toks
+ text = token_dict[docname]
+ else:
+ text = text_dict[docname]
+ doc_len = len(text)
+ underscore_len = 0
+
+ if "GUM" in docname and "reddit" not in docname:
+ output.append(line)
+ continue
+
+ if line.startswith("# text"):
+ m = re.match(r'(# ?text ?= ?)(.+)',line)
+ if m is not None:
+ i = 0
+ sent_text = ""
+ for char in m.group(2).strip():
+ if char != " ":
+ try:
+ sent_text += text[i]
+ except:
+ raise IOError("Can't fix")
+ i+=1
+ else:
+ sent_text += " "
+ line = m.group(1) + sent_text
+ output.append(line)
+ elif "\t" in line:
+ fields = line.split("\t")
+ if skiplen < 1:
+ underscore_len += len(fields[1])
+ fields[1] = text[:len(fields[1])]
+ if not "-" in fields[0] and not "." in fields[0]:
+ parse_text += fields[1]
+ tid += 1
+ tid2string[docname][tid] = fields[1]
+ if not tokfile:
+ if fields[2] == '_' and not "-" in fields[0] and not "." in fields[0]:
+ fields[2] = fields[1]
+ elif fields[2] == "*LOWER*":
+ fields[2] = fields[1].lower()
+ if skiplen < 1:
+ text = text[len(fields[1]):]
+ else:
+ skiplen -=1
+ output.append("\t".join(fields))
+ if "-" in fields[0]: # Multitoken
+ start, end = fields[0].split("-")
+ start = int(start)
+ end = int(end)
+ skiplen = end - start + 1
+ else:
+ output.append(line)
+
+ if not doc_len == underscore_len:
+ if ".rels" in file_:
+ sys.stderr.write(
+ "\n! Tried to restore file " + os.path.basename(file_) + " but source text has different length than tokens in shared task file:\n" + \
+ " Source text in data/: " + str(doc_len) + " non-whitespace characters\n" + \
+ " Token underscores in " + file_ + ": " + str(underscore_len) + " non-whitespace characters\n" + \
+ " Violation row: " + violation_rows[0])
+ else:
+ sys.stderr.write("\n! Tried to restore document " + docname + " but source text has different length than tokens in shared task file:\n" + \
+ " Source text in data/: " + str(doc_len) + " non-whitespace characters\n" + \
+ " Token underscores in " + file_+": " + str(underscore_len) + " non-whitespace characters\n")
+ with io.open("debug.txt",'w',encoding="utf8") as f:
+ f.write(text_dict[docname])
+ f.write("\n\n\n")
+ f.write(parse_text)
+ sys.exit(0)
+
+ if not tokfile and parse_text != "":
+ token_dict[docname] = parse_text
+
+ with io.open(file_, 'w', encoding='utf8', newline="\n") as fout:
+ fout.write("\n".join(output) + "\n")
+
+ print("o Restored text for DISRPT format in " + \
+ #str(len(dep_files)) + " .conllu files, " + \
+ str(len(tok_files)) + " .tok files and "+ str(len(rel_files)) + " .rels files\n")
+
+
+
+
+
+
diff --git a/annis/README.md b/annis/README.md
new file mode 100644
index 000000000..f6db682ad
--- /dev/null
+++ b/annis/README.md
@@ -0,0 +1,3 @@
+# GUM corpus - ANNIS version (without Reddit)
+
+The zip file in this directory can be imported for search and visualization into [ANNIS](https://corpus-tools.org/annis/). Note that this version of the corpus does not include the Reddit subcorpus of GUM. To compile an ANNIS version of the corpus including the Reddit subcorpus, please see [README_reddit.md](https://github.com/amir-zeldes/gum/blob/master/README_reddit.md).
\ No newline at end of file
diff --git a/get_text.py b/get_text.py
new file mode 100644
index 000000000..8919c1d1c
--- /dev/null
+++ b/get_text.py
@@ -0,0 +1,631 @@
+import ast, re, io, os, sys
+import requests
+from argparse import ArgumentParser
+from collections import defaultdict
+from glob import glob
+from requests.exceptions import ConnectionError
+from _build.utils.get_reddit.underscores import make_text, make_text_const, make_text_rst, make_underscores, make_underscores_rst, make_underscores_const
+from _build.utils.get_reddit.underscores_disrpt import underscore_files as underscore_disrpt, restore_docs as restore_disrpt
+
+PY3 = sys.version_info[0] == 3
+script_dir = os.path.dirname(os.path.realpath(__file__)) + os.sep
+
+if not PY3:
+ reload(sys)
+ sys.setdefaultencoding('utf8')
+
+docs = {
+ "GUM_reddit_macroeconomics": [
+ {"year": "2017", "month": "09", "id": "6zm74h", "type": "post","source":"undef"},
+ {"year": "2017", "month": "09", "id": "dmwwqlt", "type":"comment","source":"undef"}
+ ],
+ "GUM_reddit_stroke": [
+ {"year": "2017", "month": "08", "id": "6ws3eh", "type": "post","source":"undef"},
+ {"year": "2017", "month": "08", "id": "dmaei1x", "type":"comment","source":"undef"},
+ {"year": "2017", "month": "08", "id": "dmaiwsm", "type":"comment","source":"undef"},
+ {"year": "2017", "month": "09", "id": "dmkx8bk", "type":"comment","source":"undef"},
+ {"year": "2017", "month": "09", "id": "dmm1327", "type":"comment","source":"undef"},
+ {"year": "2017", "month": "08", "id": "dmaoodn", "type":"comment","source":"undef"}
+ ],
+ "GUM_reddit_polygraph": [
+ {"year": "2014", "month": "12", "id": "2q6qnv", "type": "post","source":"undef"}
+ ],
+ "GUM_reddit_ring": [
+ {"year": "2016", "month": "09", "id": "5570x1", "type": "post","source":"undef"},
+ {"year": "2016", "month": "09", "id": "d885ma0", "type":"comment","source":"undef"},
+ {"year": "2016", "month": "09", "id": "d8880w7", "type":"comment","source":"undef"},
+ {"year": "2016", "month": "09", "id": "d88u7dg", "type":"comment","source":"undef"},
+ {"year": "2016", "month": "09", "id": "d88unu3", "type":"comment","source":"undef"},
+ {"year": "2016", "month": "09", "id": "d88v0sz", "type":"comment","source":"undef"},
+ {"year": "2016", "month": "09", "id": "d88xaqu", "type":"comment","source":"undef"},
+ {"year": "2016", "month": "10", "id": "d893mj9", "type":"comment","source":"undef"},
+ {"year": "2016", "month": "09", "id": "d88s4bb", "type":"comment","source":"undef"},
+ {"year": "2016", "month": "10", "id": "d88zt6x", "type":"comment","source":"undef"}
+ ],
+ "GUM_reddit_space": [
+ {"year": "2016", "month": "08", "id": "50hx5c", "type": "post","source":"undef"},
+ {"year": "2016", "month": "08", "id": "d7471k5", "type":"comment","source":"undef"},
+ {"year": "2016", "month": "08", "id": "d74i5ka", "type":"comment","source":"undef"},
+ {"year": "2016", "month": "08", "id": "d74ppi0", "type":"comment","source":"undef"}
+ ],
+ "GUM_reddit_superman": [
+ #{"year": "2017", "month": "04", "id": "68e0u3", "type": "post", "title_only": True}, # Post title not included in this document
+ {"year": "2017", "month": "05", "id": "dgys1z8", "type":"comment","source":"undef"}
+ ],
+ "GUM_reddit_bobby": [
+ {"year":"2018","month":"06","id":"8ph56q","type": "post","source":"undef"},
+ {"year":"2018","month":"06","id":"e0b8zz4","type":"comment","source":"undef"},
+ {"year":"2018","month":"06","id":"e0dwqlg","type":"comment","source":"undef"},
+ {"year":"2018","month":"06","id":"e15pcqu","type":"comment","source":"undef"},
+ {"year":"2018","month":"06","id":"e0dz1mp","type":"comment","source":"undef"},
+ {"year":"2018","month":"06","id":"e1uuo9e","type":"comment","source":"undef"},
+ {"year":"2018","month":"06","id":"e0brc9w","type":"comment","source":"undef"},
+ {"year":"2018","month":"06","id":"e0bz951","type":"comment","source":"undef"}
+ ],
+ "GUM_reddit_escape": [
+ {"year":"2017","month":"05","id":"69r98j","type": "post","source":"undef"},
+ {"year":"2017","month":"05","id":"dh96n8v","type":"comment","source":"undef"},
+ {"year":"2017","month":"05","id":"dh9enpe","type":"comment","source":"undef"},
+ {"year":"2017","month":"05","id":"dht8oyn","type":"comment","source":"undef"},
+ {"year":"2017","month":"05","id":"dhn0hoe","type":"comment","source":"undef"},
+ {"year":"2017","month":"07","id":"dk9ted1","type":"comment","source":"undef"},
+ {"year":"2017","month":"05","id":"dh98kcg","type":"comment","source":"undef"},
+ {"year":"2017","month":"05","id":"dh9zxej","type":"comment","source":"undef"},
+ {"year":"2017","month":"05","id":"di9x7j9","type":"comment","source":"undef"},
+ {"year":"2017","month":"05","id":"di9xsrt","type":"comment","source":"undef"},
+ {"year":"2017","month":"06","id":"din85zf","type":"comment","source":"undef"},
+ {"year":"2017","month":"06","id":"dinab0w","type":"comment","source":"undef"},
+ {"year":"2017","month":"06","id":"dinaggd","type":"comment","source":"undef"},
+ {"year":"2017","month":"06","id":"dinbyb9","type":"comment","source":"undef"},
+ {"year":"2017","month":"06","id":"dj65sp1","type":"comment","source":"undef"},
+ {"year":"2017","month":"06","id":"dizdd8a","type":"comment","source":"undef"},
+ {"year":"2017","month":"07","id":"dk78qw8","type":"comment","source":"undef"},
+ {"year":"2017","month":"08","id":"dm0gqc7","type":"comment","source":"undef"},
+ {"year":"2017","month":"10","id":"domd1r0","type":"comment","source":"undef"},
+ {"year":"2017","month":"05","id":"dh9irie","type":"comment","source":"undef"},
+ {"year":"2017","month":"05","id":"dh9iw36","type":"comment","source":"undef"},
+ {"year":"2017","month":"06","id":"djlcwu5","type":"comment","source":"undef"},
+ {"year":"2017","month":"06","id":"dlzcxpy","type":"comment","source":"undef"},
+ {"year":"2017","month":"05","id":"dhabstb","type":"comment","source":"undef"},
+ {"year":"2017","month":"05","id":"dhbr3m6","type":"comment","source":"undef"},
+ {"year":"2017","month":"06","id":"diz97qy","type":"comment"}
+ ],
+ "GUM_reddit_gender": [
+ {"year":"2018","month":"09","id":"9e5urs","type":"post","source":"bigquery"},
+ {"year":"2018","month":"09","id":"e5mg3s7","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5mkpok","type":"comment","source":"bigquery"},
+ {"year":"2018","month":"09","id":"e5nxbmb","type":"comment","source":"bigquery"},
+ {"year":"2018","month":"09","id":"e5nzg9j","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5mh94v","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5mmenp","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5ms5u3","type":"comment","source":"undef"}
+ ],
+ "GUM_reddit_monsters":[
+ {"year":"2018","month":"09","id":"9eci2u","type":"post","source":"undef"},
+ {"year":"2018","month":"09","id":"e5ox2jr","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5p3gtl","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5pnfro","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5q08o4","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5pney1","type":"comment","source":"undef"},
+ ],
+ "GUM_reddit_pandas":[
+ {"year":"2018","month":"09","id":"9e3s9h","type":"post","source":"undef"},
+ {"year":"2018","month":"09","id":"e5lwy6n","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m397o","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m3xgb","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m3z2e","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5lwbbt","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m38sr","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m42cu","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5lvlxm","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5lvqay","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5lw5t6","type":"comment","source":"undef"}, # Blowhole
+ {"year":"2018","month":"09","id":"e5lwz31","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5lxi0s","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5lwxqq","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5lzv1b","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m48ag","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m1yqe","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5lx0sw","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m2n80","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m2wrh","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m3blb","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5lvxoc","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m1abg","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m1w5i","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m3pdi","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m3ruf","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m4yu2","type":"comment","source":"undef"},
+ {"year":"2018","month":"09","id":"e5m5bcb","type":"comment","source":"undef"}
+ ],
+ "GUM_reddit_steak": [
+ {"year":"2015","month":"08","id":"3im341","type":"post","source":"undef"}
+ ],
+ "GUM_reddit_card": [
+ {"year":"2019","month":"08","id":"cmqrwo","type":"post","source":"undef"},
+ {"year":"2019","month":"08","id":"ew3zrqg","type":"comment","source":"undef"},
+ {"year":"2019","month":"08","id":"ew43d2c","type":"comment","source":"undef"},
+ {"year":"2019","month":"08","id":"ew43oks","type":"comment","source":"undef"},
+ {"year":"2019","month":"08","id":"ew43ymc","type":"comment","source":"undef"},
+ {"year":"2019","month":"08","id":"ew46h1p","type":"comment","source":"undef"},
+ {"year":"2019","month":"08","id":"ew46oly","type":"comment","source":"undef"},
+ {"year":"2019","month":"08","id":"ew46wq7","type":"comment","source":"undef"},
+ {"year":"2019","month":"08","id":"ew470zc","type":"comment","source":"undef"}
+ ],
+ "GUM_reddit_callout": [
+ {"year":"2019","month":"09","id":"d1eg3u","type":"post","source":"undef"},
+ {"year":"2019","month":"09","id":"ezkucpg","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezkv0cc","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezkwbx9","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezlh2o6","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezlkajf","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezlnco2","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezo20yy","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezkwcvh","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezl07dm","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezmajm7","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezl1wz3","type":"comment","source":"undef"},
+ ],
+ "GUM_reddit_conspiracy": [
+ {"year":"2019","month":"02","id":"aumhwo","type":"post","source":"undef"},
+ {"year":"2019","month":"02","id":"eh9rt0n","type":"comment","source":"undef"},
+ {"year":"2019","month":"02","id":"eh9tvyw","type":"comment","source":"undef"},
+ {"year":"2019","month":"02","id":"ehc0l2q","type":"comment","source":"undef"},
+ {"year":"2019","month":"02","id":"ehclwtv","type":"comment","source":"undef"},
+ {"year":"2019","month":"02","id":"eh9jo5x","type":"comment","source":"undef"},
+ {"year":"2019","month":"02","id":"ehr2665","type":"comment","source":"undef"},
+ {"year":"2019","month":"02","id":"eha3c1q","type":"comment","source":"undef"},
+ {"year":"2019","month":"02","id":"eha5jlq","type":"comment","source":"undef"},
+ ],
+ "GUM_reddit_introverts": [
+ {"year":"2019","month":"06","id":"by820m","type":"post","source":"undef","title_double": True}, # Possible title was repeated by annotator
+ {"year":"2019","month":"06","id":"eqeik8m","type":"comment","source":"undef"},
+ {"year":"2019","month":"06","id":"eqfgaeu","type":"comment","source":"undef"},
+ {"year":"2019","month":"06","id":"eqfplpg","type":"comment","source":"undef"},
+ {"year":"2019","month":"06","id":"eqg6a5u","type":"comment","source":"undef"},
+ {"year":"2019","month":"06","id":"eqh6j29","type":"comment","source":"undef"},
+ {"year":"2019","month":"06","id":"eqhjtwr","type":"comment","source":"undef"},
+ {"year":"2019","month":"06","id":"eqi2jl3","type":"comment","source":"undef"},
+ {"year":"2019","month":"06","id":"eqii2kf","type":"comment","source":"undef"},
+ {"year":"2019","month":"06","id":"eqhlj8j","type":"comment","source":"undef"},
+
+ ],
+ "GUM_reddit_racial": [
+ {"year":"2019","month":"09","id":"d1urjk","type":"post","source":"undef"},
+ {"year":"2019","month":"09","id":"ezq9y6w","type":"comment","source":"bigquery"},
+ {"year":"2019","month":"09","id":"ezqpqmm","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezq8xs7","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezr55wk","type":"comment","source":"undef"},
+ ],
+ "GUM_reddit_social": [
+ {"year":"2019","month":"09","id":"d1qy3g","type":"post","source":"undef"},
+ {"year":"2019","month":"09","id":"ezpb3jg","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezpdmy3","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezpjor8","type":"comment","source":"bigquery"},
+ {"year":"2019","month":"09","id":"ezpiozm","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezpc1ps","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezp9fbh","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezqrumb","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezpe0e6","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezpf71f","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezt7qlf","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezpc4jj","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezpa2e4","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezpfzql","type":"comment","source":"undef"},
+ {"year":"2019","month":"09","id":"ezpi39v","type":"comment","source":"undef"},
+ ]
+}
+
+
+def get_proxy_data():
+ out_posts = {}
+ try:
+ # Try fetching from corpling server
+ raise ConnectionError
+ tab_delim = requests.get("https://corpling.uis.georgetown.edu/gum/fetch_text_proxy.py").text
+ except ConnectionError:
+ # Fall back to mirror on coptic-dictionary.org
+ tab_delim = requests.get("https://coptic-dictionary.org/gum/fetch_text_proxy.py").text
+ for line in tab_delim.split("\n"):
+ if "\t" in line:
+ post, text = line.split("\t")
+ out_posts[post] = text
+ return out_posts
+
+
+def get_via_praw(post_id, post_type,praw_cred):
+
+ if praw_cred is None:
+ raise IOError("Missing praw credentials")
+
+ from praw import Reddit
+
+ reddit = Reddit(client_id=praw_cred["client_id"], client_secret=praw_cred["client_secret"],
+ password=praw_cred["password"], user_agent=praw_cred["user_agent"],username=praw_cred["username"])
+
+ if post_type == "post":
+ submission = reddit.submission(post_id)
+ created_utc = submission.mod.thing.created_utc
+ selftext = submission.mod.thing.selftext
+ selftext = re.sub(r'\s+',' ',selftext)
+ selftext = selftext.replace("'","\\'")
+ title = submission.mod.thing.title
+ title = title.replace("'","\\'")
+ out_json = "[{'id':'"+post_id+"','selftext':'"+selftext+"','created_utc':"+str(int(created_utc))+",'title':'"+title+"'}]"
+ else:
+ submission = reddit.comment(post_id)
+ created_utc = submission.mod.thing.created_utc
+ selftext = submission.mod.thing.body
+ selftext = re.sub(r'\s+',' ',selftext)
+ selftext = selftext.replace("'","\\'")
+ title = ""
+ out_json = "[{'id':'"+post_id+"','body':'"+selftext+"','created_utc':"+str(int(created_utc))+"}]"
+
+ return out_json
+
+
+def get_post(year, month, post_id, post_type):
+ from bigquery import get_client
+
+ # JSON key provided by Google
+ json_key = os.path.dirname(os.path.realpath(__file__)) + os.sep + 'key.json'
+
+ client = get_client(json_key_file=json_key, readonly=True)
+
+ if post_type == "post":
+ post_or_comment = "posts"
+ else:
+ post_or_comment = "comments"
+ table_name = "fh-bigquery.reddit_"+post_or_comment+"."+year+"_"+month
+
+ # Submit an async query.
+ query = "SELECT * FROM [" + table_name + "] WHERE id = '"+post_id+"';"
+ job_id, _results = client.query(query)
+
+ sleep(3)
+
+ # Check if the query has finished running.
+ complete, row_count = client.check_job(job_id)
+
+ # Retrieve the results.
+ results = client.get_query_rows(job_id)
+
+ return str(results)
+
+
+def get_no_space_strings(cache_dict, praw_cred=None, overwrite_cache=False):
+
+ no_space_docs = defaultdict(str)
+
+ for doc in docs:
+ for post in docs[doc]:
+ if post["id"] in cache_dict:
+ json_result = cache_dict[post["id"]]
+ if overwrite_cache:
+ with io.open(os.path.dirname(os.path.realpath(__file__)) + os.sep + "cache.txt", "a", encoding="utf8") as f:
+ f.write(post["id"] + "\t" + json_result.strip() + "\n")
+ else:
+ if (int(post["year"]) >2015 and int(post["year"]) < 2017) or (int(post["year"]==2015) and post["month"] == "12") or post["source"] == "bigquery": # Available from bigquery
+ json_result = get_post(post["year"],post["month"],post["id"],post["type"])
+ else:
+ json_result = get_via_praw(post["id"],post["type"],praw_cred)
+ with io.open(os.path.dirname(os.path.realpath(__file__)) + os.sep + "cache.txt","a",encoding="utf8") as f:
+ f.write(post["id"] + "\t" + json_result.strip() + "\n")
+ parsed = ast.literal_eval(json_result)[0]
+ if post["type"]=="post":
+ plain = parsed["selftext"]
+ title = parsed["title"]
+ if "title_only" in post:
+ if post["title_only"]:
+ plain = ""
+ if "title_double" in post:
+ title = title + " " + title
+ else:
+ plain = parsed["body"]
+ title = ""
+ if "_space" in doc:
+ plain = plain.replace(">","") # GUM_reddit_space has formatting > to indicate indented block quotes
+ elif "_gender" in doc:
+ plain = plain.replace("- The vast","The vast")
+ plain = plain.replace("- Society already accommodates","Society already accommodates")
+ plain = plain.replace("- Society recognizes disabilities","Society recognizes disabilities")
+ plain = plain.replace("- It’s a waste of time","It’s a waste of time")
+ plain = plain.replace("PB&J","PB&J")
+ elif "_monsters" in doc:
+ plain = plain.replace("1. He refers to","a. He refers to")
+ plain = plain.replace("2. Using these","b. Using these")
+ plain = plain.replace("3. And he has","c. And he has")
+ plain = plain.replace(" ","")
+ plain = re.sub(r' [0-9]+\. ',' ',plain)
+ elif "_ring" in doc:
+ plain = plain.replace(">",">")
+ elif "_escape" in doc:
+ plain = plain.replace("*1 year later*","1 year later")
+ elif "_racial" in doc:
+ plain = plain.replace("> ","")
+ elif "_callout" in doc:
+ plain = plain.replace("_it","it").replace("well?_","well?").replace(">certain","certain")
+ elif "_conspiracy" in doc:
+ plain = plain.replace(">", "")
+ elif "_stroke" in doc:
+ plain = plain.replace("&", "&")
+ elif "_bobby" in doc:
+ plain = plain.replace("&", "&")
+ elif "_introvert" in doc:
+ plain = plain.replace("enjoy working out.","enjoy working out").replace("~~","")
+ elif "_social" in doc:
+ plain = plain.replace("the purpose","those purpose").replace("","")
+ no_space = re.sub(r"\s","",plain).replace("*","")
+ no_space = re.sub(r'\[([^]]+)\]\([^)]+\)',r'\1',no_space) # Remove Wiki style links: [text](URL)
+ if no_space_docs[doc] == "":
+ no_space_docs[doc] += re.sub(r"\s","",title).replace("*","")
+ no_space_docs[doc] += no_space
+
+ return no_space_docs
+
+
+def run_fetch():
+ if not os.path.isfile(script_dir + os.sep + "cache.txt"):
+ io.open(script_dir + os.sep + "cache.txt", "a").close() # Make sure cache file exists
+
+ cache = io.open(script_dir + os.sep + "cache.txt", encoding="utf8")
+ cache_dict = {}
+
+ for line in cache.read().split("\n"):
+ if "\t" in line:
+ post_id, text = line.split("\t")
+ cache_dict[post_id] = text if PY3 else text.decode("utf8")
+
+ if not os.path.isfile(script_dir + os.sep + "praw.txt"):
+ io.open(script_dir + os.sep + "praw.txt", "a").close() # Make sure praw file existss
+
+ praw_cred = io.open(script_dir + os.sep + "praw.txt",encoding="utf8")
+ praw_dict = {}
+
+ for line in praw_cred.read().split("\n"):
+ if "\t" in line and not line.startswith("#"):
+ key, val = line.split("\t")
+ praw_dict[key] = val
+
+
+ # Check if cache is already complete
+ post_ids = []
+ for doc in docs:
+ for post in docs[doc]:
+ post_ids.append(post["id"])
+ if any(post not in cache_dict for post in post_ids):
+ incomplete = True
+ else:
+ incomplete = False
+
+ if incomplete:
+ # Check that user has valid json and praw credentials
+ if not all(key in praw_dict for key in ["client_id","client_secret","password","user_agent","username"]):
+ print("Missing praw credentials detected! You cannot download reddit data using praw.")
+ has_praw_cred = False
+ else:
+ has_praw_cred = True
+ try:
+ import praw
+ except ImportError as e:
+ print("Library praw not installed (pip install praw). You cannot download reddit data using praw.")
+ has_praw_cred = False
+
+ if not os.path.isfile(script_dir + os.sep + "key.json"):
+ print("Can't find Google BigQuery json key file. You cannot download reddit data using bigquery")
+ has_bigquery = False
+ else:
+ try:
+ has_bigquery = True
+ import bigquery
+ except ImportError as e:
+ print("Library bigquery not installed (pip install bigquery). You cannot download reddit data using bigquery.")
+ has_bigquery = False
+
+ if not has_praw_cred or not has_bigquery:
+ print("Missing access to bigquery and/or praw.")
+ print("Do you want to try downloading reddit data from an available server?")
+ print("Confirm: you are solely responsible for downloading reddit data and may only use it for non-commercial purposes:")
+ try:
+ # for python 2
+ response = raw_input("[Y]es/[N]o> ")
+ except NameError:
+ # for python 3
+ response = input("[Y]es/[N]o> ")
+
+ if response == "Y":
+ print("Retrieving reddit data by proxy...")
+ cache_dict = get_proxy_data()
+ out_docs = get_no_space_strings(cache_dict, overwrite_cache=True)
+ return out_docs, cache_dict
+ else:
+ print("Aborting")
+ sys.exit()
+ else:
+ print("Found praw and bigquery credentials.")
+ print("Would you like to use them to download reddit data?")
+ print("Confirm: you are solely responsible for downloading reddit data and may only use it for non-commercial purposes:")
+ try:
+ # for python 2
+ response = raw_input("[Y]es/[N]o> ")
+ except NameError:
+ # for python 3
+ response = input("[Y]es/[N]o> ")
+
+ if response == "Y":
+ print("Retrieving reddit data...")
+ out_docs = get_no_space_strings(cache_dict,praw_cred=praw_dict)
+ else:
+ print("Do you want to try downloading reddit data from an available server?")
+ print("Confirm: you are solely responsible for downloading reddit data and may only use it for non-commercial purposes:")
+ try:
+ # for python 2
+ response = raw_input("[Y]es/[N]o> ")
+ except NameError:
+ # for python 3
+ response = input("[Y]es/[N]o> ")
+
+ if response == "Y":
+ print("Retrieving reddit data by proxy...")
+ cache_dict = get_proxy_data()
+ out_docs = get_no_space_strings(cache_dict, overwrite_cache=True)
+ return out_docs, cache_dict
+ else:
+ print("Aborting")
+ sys.exit()
+ else:
+ print("Found complete reddit data in cache.txt ...")
+ print("Compiling raw strings")
+ out_docs = get_no_space_strings(cache_dict)
+
+ return out_docs
+
+
+if __name__ == "__main__":
+
+ p = ArgumentParser()
+ p.add_argument("-m","--mode",choices=["del","add"],default="add",help="Add or remove Reddit text data")
+ opts = p.parse_args()
+
+ if opts.mode == "del":
+ script_dir += os.sep
+ make_underscores(script_dir + "xml" + os.sep, 0, lemma_col=2)
+ make_underscores(script_dir + "coref" + os.sep + "gum" + os.sep + "tsv" + os.sep, 2)
+ make_underscores(script_dir + "coref" + os.sep + "ontogum" + os.sep + "tsv" + os.sep, 2)
+ make_underscores(script_dir + "coref" + os.sep + "gum" + os.sep + "conll" + os.sep, 1)
+ make_underscores(script_dir + "coref" + os.sep + "ontogum" + os.sep + "conll" + os.sep, 1)
+ make_underscores(script_dir + "coref" + os.sep + "ontogum" + os.sep + "conllu" + os.sep, 1)
+ make_underscores(script_dir + "dep" + os.sep, 1)
+ make_underscores_rst(script_dir + "rst" + os.sep + "rstweb" + os.sep)
+ make_underscores_rst(script_dir + "rst" + os.sep + "dependencies" + os.sep, extension="rsd", edu_regex=r"^([^\t\n]+\t)([^\t\n]+)(\t[^\n]+)")
+ make_underscores_rst(script_dir + "rst" + os.sep + "lisp_binary" + os.sep, extension="dis", edu_regex=r"^([^\n]+text _!)(.*?)(_![^\n]+)")
+ make_underscores_rst(script_dir + "rst" + os.sep + "lisp_nary" + os.sep, extension="dis", edu_regex=r"^([^\n]+text _!)(.*?)(_![^\n]+)")
+ underscore_disrpt(script_dir + "rst" + os.sep + "disrpt" + os.sep)
+ make_underscores_const(script_dir + "const" + os.sep)
+ else:
+ text_dict = run_fetch()
+ script_dir += os.sep
+
+ docs2lemmas, docs2tokens = make_text(script_dir + "xml" + os.sep, text_dict, 0, lemma_col=2)
+ make_text(script_dir + "coref" + os.sep + "gum" + os.sep + "tsv" + os.sep, text_dict, 2, unescape_xml=True)
+ make_text(script_dir + "coref" + os.sep + "ontogum" + os.sep + "tsv" + os.sep, text_dict, 2, unescape_xml=True)
+ make_text(script_dir + "coref" + os.sep + "gum" + os.sep + "conll" + os.sep, text_dict, 1, unescape_xml=True)
+ make_text(script_dir + "coref" + os.sep + "ontogum" + os.sep + "conll" + os.sep, text_dict, 1, unescape_xml=True)
+ make_text(script_dir + "coref" + os.sep + "ontogum" + os.sep + "conllu" + os.sep, text_dict, 1, unescape_xml=True, docs2lemmas=docs2lemmas, docs2tokens=docs2tokens)
+ make_text(script_dir + "dep" + os.sep, text_dict, 1, unescape_xml=True, docs2lemmas=docs2lemmas, docs2tokens=docs2tokens)
+ make_text_rst(script_dir + "rst" + os.sep + "rstweb" + os.sep, text_dict)
+ make_text_rst(script_dir + "rst" + os.sep + "dependencies" + os.sep, text_dict, unescape_xml=True, extension="rsd", edu_regex=r"^([^\t\n]+\t)([^\t\n]+)(\t[^\n]+)")
+ make_text_rst(script_dir + "rst" + os.sep + "lisp_binary" + os.sep, text_dict, unescape_xml=True, extension="dis", edu_regex=r"^([^\n]+text _!)(.*?)(_![^\n]+)")
+ make_text_rst(script_dir + "rst" + os.sep + "lisp_nary" + os.sep, text_dict, unescape_xml=True, extension="dis", edu_regex=r"^([^\n]+text _!)(.*?)(_![^\n]+)")
+ make_text_const(script_dir + "const" + os.sep, text_dict)
+ restore_disrpt(script_dir + "rst" + os.sep + "disrpt" + os.sep, text_dict)
+
+ if False:
+
+ files = glob(script_dir + os.sep + "dep" + os.sep + "*_reddit_*.conllu")
+
+ for file_ in files:
+ next_doc = False
+ doc = os.path.basename(file_).replace(".conllu","")
+ if doc not in docs2chars:
+ sys.stderr.write("ERR: Could not find text data for document " + doc + "! Skipping...\n")
+ continue
+
+ text = docs2chars[doc]
+ with io.open(file_,encoding="utf8") as f:
+ lines = f.read().split("\n")
+ output = []
+ sents = []
+ sent = ""
+ word_len = 0
+ skip = 0
+ no_space_next = False
+ skip_space = False
+ for line in lines:
+ if "\t" in line:
+ fields = line.split("\t")
+ if "-" not in fields[0] and "." not in fields[0]: # Token
+ # Process MISC field
+ misc_annos = fields[-1].split("|")
+ out_misc = []
+ for anno in misc_annos:
+ if anno.startswith("Len="):
+ word_len = int(anno.split('=')[1])
+ elif anno.startswith("Lem="):
+ lemma_rule = anno.split('=')[1]
+ else:
+ out_misc.append(anno)
+ if word_len == 0: # There was no Len annotation, documents are already restored?
+ sys.stderr.write("ERR: Missing word length information in doc " + doc + ". Has text been restored already? Skipping...\n")
+ next_doc = True
+ break
+ if len(out_misc) == 0:
+ out_misc = ["_"]
+ fields[-1] = "|".join(sorted(out_misc))
+
+ # Reconstruct word and lemma
+ word = text[:word_len]
+ text = text[word_len:]
+ fields[1] = word
+ if lemma_rule == "*LOWER*":
+ fields[2] = word.lower()
+ elif lemma_rule == "_":
+ fields[2] = word
+ else:
+ fields[2] = lemma_rule
+ if fields[7] == "goeswith":
+ fields[2] = "_" # No lemma for goeswith token
+ if fields[0] == "1" and sent != "": # New sentence
+ sents.append(sent.strip())
+ sent = ""
+ sent += word
+ if skip > 0:
+ skip -= 1
+ if skip == 0 and no_space_next:
+ skip_space = True
+ else:
+ skip_space = False
+ if "SpaceAfter=No" not in fields[-1] and not skip_space and skip == 0:
+ sent += " "
+ no_space_next = False
+ skip_space = False
+ line = "\t".join(fields)
+ docs2tokens[doc].append(fields[1])
+ docs2lemmas[doc].append(fields[2])
+ elif "-" in fields[0]:
+ misc_annos = fields[-1].split("|")
+ out_misc = []
+ for anno in misc_annos:
+ if anno.startswith("Len="):
+ word_len = int(anno.split('=')[1])
+ else:
+ out_misc.append(anno)
+ fields[1] = text[:word_len]
+ fields[-1] = "|".join(out_misc) if len(out_misc) > 0 else "_"
+ line = "\t".join(fields)
+ skip = 2
+ if "SpaceAfter=No" in fields[-1]:
+ no_space_next = True
+
+ output.append(line)
+
+ if next_doc:
+ continue
+
+ sents.append(sent.strip())
+ no_sent_text = "\n".join(output)
+
+ out_sents = []
+ raw_sents = no_sent_text.split("\n\n")
+ for i,sent in enumerate(raw_sents):
+ if i