-
Notifications
You must be signed in to change notification settings - Fork 49
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Reddit data can now be reconstructed in top folders without rebuilding the corpus
1 parent
cd17cea
commit 40af9bd
Showing
7 changed files
with
985 additions
and
48 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,24 @@ | ||
# Data from reddit | ||
# Data from Reddit | ||
|
||
For one of the text types in this corpus, reddit forum discussions, plain text data is not supplied in this repository. To obtain this data, please follow the instructions below. | ||
For one of the text types in this corpus, Reddit forum discussions, plain text data is not supplied in this repository. To obtain this data, please follow the instructions below. | ||
|
||
## Annotations | ||
|
||
Documents in the reddit subcorpus are named GUM_reddit_* (e.g. GUM_reddit_superman) and are *not* included in the root folder with all annotation layers. The annotations for the reddit subcorpus can be found together with all other document annotations in `_build/src/`. Token representations in these files are replaced with underscores, while the annotations themselves are included in the files. To compile the corpus including reddit data, you must obtain the underlying texts. | ||
Documents in the Reddit subcorpus are named `GUM_reddit_*` (e.g. GUM_reddit_superman) and are included in the root folder with all annotation layers but with underscores instead of text. To compile the corpus including Reddit data, you must obtain the underlying texts, and either regenerate the files in the top level folders (works for all formats except `PAULA` and `annis`), or rebuild the corpus (see below). | ||
|
||
## Obtaining underlying reddit text data | ||
## Obtaining underlying Reddit text data | ||
|
||
To recover reddit data, use the API provided by the script `_build/process_reddit.py`. If you have your own credentials for use with the Python reddit API wrapper (praw) and Google bigquery, you should include them in two files, `praw.txt` and `key.json` in `_build/utils/get_reddit/`. For this to work, you must have the praw and bigquery libraries installed for python (e.g. via pip). You can then run `python _build/process_reddit.py` to recover the data, and proceed to the next step, re-building the corpus. | ||
To recover Reddit data, use the API provided by the Python script `get_text.py`, which will restore text in all top-level folders except for `PAULA` and `annis`. If you do not have credentials for the Python Reddit API wrapper (praw) and Google bigquery, the script can attempt to download data for you from a proxy. Otherwise you can also use your own credentials for praw etc. and include them in two files, `praw.txt` and `key.json`. For this to work, you must have the praw and bigquery libraries installed for python (e.g. via pip). | ||
|
||
Alternatively, if you can't use praw/bigquery, the script `_build/process_reddit.py` will offer to download the data for you by proxy. To do this, run the script and confirm that you will only use the data according to the terms and conditions determined by reddit, and for non-commercial purposes. The script will then download the data for you - if the download is successful, you can continue to the next step and re-build the corpus. | ||
If you also require the `PAULA` and `annis` formats, you must rebuild the corpus from `_build/src/`. To do this, run `_build/process_reddit.py`, which again requires either running a proxy download or using your own credentials and placing them in `_build/utils/get_reddit/`. Once the download completes successfully, you will need to rebuild the corpus as explained in the next step. | ||
|
||
## Rebuilding the corpus with reddit data | ||
## Rebuilding the corpus with Reddit data | ||
|
||
To compile all projected annotations and produce all formats not included in `_build/src/`, you will need to run the GUM build bot: `python _build/build_gum.py`. This process is described in detail at https://gucorpling.org/gum/build.html, but summarized instructions follow. | ||
To compile all projected annotations and produce all formats not included in `_build/src/`, you will need to run the GUM build bot: `python build_gum.py` in `_build/`. This process is described in detail at https://gucorpling.org/gum/build.html, but summarized instructions follow. | ||
|
||
At a minumum, you can run `python _build/build_gum.py` with no options. This will produce basic formats in `_build/target/`, but skip generating fresh constituent parses, CLAWS5 tags and the Universal Dependencies version of the dependency data. To include these you will need: | ||
At a minumum, you can run `build_gum.py` with no options. This will produce basic formats in `_build/target/`, but skip generating fresh constituent parses and CLAWS5 tags. To include these you will need: | ||
|
||
* CLAWS5: use option -c and ensure that utils/paths.py points to an executable for the TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). The CLAWS 5 parameter file is already included in utils/treetagger/lib/, and tags are auto-corrected by the build bot based on gold PTB tags. | ||
* Constituent parses: option -p; ensure that paths.py correctly points your installation of the Stanford Parser/CoreNLP | ||
* Universal Dependencies: option -u; ensure the paths.py points to CoreNLP, and that you have installed udapi and depedit (pip install udapi; pip install depedit). Note that this only works with Python 3. | ||
|
||
If you run into problems building the corpus, feel free to report an issue via GitHub or contact us via e-mail. | ||
After the build bot runs, data including `PAULA` and `annis` versions will be generated in the specified `target/` folder. If you run into problems building the corpus, feel free to report an issue via GitHub or contact us via e-mail. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,264 @@ | ||
""" | ||
process_underscores.py | ||
Script to handle licensed data for which underlying text cannot be posted online (e.g. LDC data). | ||
Users need a copy of the LDC distribution of an underlying resource to restore text in some of the corpora. | ||
""" | ||
|
||
__author__ = "Amir Zeldes" | ||
__license__ = "Apache 2.0" | ||
__version__ = "2.0.0" | ||
|
||
import io, re, os, sys | ||
from glob import glob | ||
from collections import defaultdict | ||
script_dir = os.path.dirname(os.path.realpath(__file__)) + os.sep | ||
|
||
PY3 = sys.version_info[0] == 3 | ||
|
||
|
||
def underscore_files(filenames): | ||
def underscore_rel_field(text): | ||
blanked = [] | ||
text = text.replace("<*>","❤") | ||
for c in text: | ||
if c!="❤" and c!=" ": | ||
blanked.append("_") | ||
else: | ||
blanked.append(c) | ||
return "".join(blanked).replace("❤","<*>") | ||
|
||
if isinstance(filenames,str): | ||
filenames = glob(filenames + "*.*") | ||
for f_path in filenames: | ||
skiplen = 0 | ||
with io.open(f_path, 'r', encoding='utf8') as fin: | ||
lines = fin.readlines() | ||
|
||
with io.open(f_path, 'w', encoding='utf8', newline="\n") as fout: | ||
output = [] | ||
if f_path.endswith(".rels"): | ||
for l, line in enumerate(lines): | ||
line = line.strip() | ||
if "\t" in line and l > 0: | ||
doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label = line.split("\t") | ||
if "GUM" in doc and "reddit" not in doc: | ||
output.append(line) | ||
continue | ||
unit1_txt = underscore_rel_field(unit1_txt) | ||
unit2_txt = underscore_rel_field(unit2_txt) | ||
unit1_sent = underscore_rel_field(unit1_sent) | ||
unit2_sent = underscore_rel_field(unit2_sent) | ||
fields = doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label | ||
line = "\t".join(fields) | ||
output.append(line) | ||
else: | ||
doc = "" | ||
for line in lines: | ||
line = line.strip() | ||
if line.startswith("# newdoc id"): | ||
doc = line.split("=",maxsplit=1)[1].strip() | ||
if "GUM" in doc and "reddit" not in doc: | ||
output.append(line) | ||
continue | ||
if line.startswith("# text"): | ||
m = re.match(r'(# text ?= ?)(.+)',line) | ||
if m is not None: | ||
line = m.group(1) + re.sub(r'[^\s]','_',m.group(2)) | ||
output.append(line) | ||
elif "\t" in line: | ||
fields = line.split("\t") | ||
tok_col, lemma_col = fields[1:3] | ||
if lemma_col == tok_col: # Delete lemma if identical to token | ||
fields[2] = '_' | ||
elif tok_col.lower() == lemma_col: | ||
fields[2] = "*LOWER*" | ||
if skiplen < 1: | ||
fields[1] = len(tok_col)*'_' | ||
else: | ||
skiplen -=1 | ||
output.append("\t".join(fields)) | ||
if "-" in fields[0]: # Multitoken | ||
start, end = fields[0].split("-") | ||
start = int(start) | ||
end = int(end) | ||
skiplen = end - start + 1 | ||
else: | ||
output.append(line) | ||
fout.write('\n'.join(output) + "\n") | ||
|
||
|
||
def restore_docs(path_to_underscores,text_dict): | ||
def restore_range(range_string, underscored, tid_dict): | ||
output = [] | ||
tok_ids = [] | ||
range_strings = range_string.split(",") | ||
for r in range_strings: | ||
if "-" in r: | ||
s, e = r.split("-") | ||
tok_ids += list(range(int(s),int(e)+1)) | ||
else: | ||
tok_ids.append(int(r)) | ||
|
||
for tok in underscored.split(): | ||
if tok == "<*>": | ||
output.append(tok) | ||
else: | ||
tid = tok_ids.pop(0) | ||
output.append(tid_dict[tid]) | ||
return " ".join(output) | ||
|
||
dep_files = glob(path_to_underscores+os.sep+"*.conllu") | ||
tok_files = glob(path_to_underscores+os.sep+"*.tok") | ||
rel_files = glob(path_to_underscores+os.sep+"*.rels") | ||
skiplen = 0 | ||
token_dict = {} | ||
tid2string = defaultdict(dict) | ||
for file_ in dep_files + tok_files + rel_files: | ||
lines = io.open(file_,encoding="utf8").readlines() | ||
underscore_len = 0 # Must match doc_len at end of file processing | ||
doc_len = 0 | ||
if file_.endswith(".rels"): | ||
output = [] | ||
violation_rows = [] | ||
for l, line in enumerate(lines): | ||
line = line.strip() | ||
if l > 0 and "\t" in line: | ||
fields = line.split("\t") | ||
docname = fields[0] | ||
text = text_dict[docname] | ||
if "GUM_" in docname and "reddit" not in docname: # Only Reddit documents need reconstruction in GUM | ||
output.append(line) | ||
continue | ||
doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label = line.split("\t") | ||
underscore_len += unit1_txt.count("_") + unit2_txt.count("_") + unit1_sent.count("_") + unit2_sent.count("_") | ||
if underscore_len == 0: | ||
#sys.stderr.write("! Non-underscored file detected - " + os.path.basename(file_) + "\n") | ||
print("! DISRPT format alreadt restored in " + os.path.basename(file_) + "\n") | ||
sys.exit(0) | ||
unit1_txt = restore_range(unit1_toks, unit1_txt, tid2string[docname]) | ||
unit2_txt = restore_range(unit2_toks, unit2_txt, tid2string[docname]) | ||
unit1_sent = restore_range(s1_toks, unit1_sent, tid2string[docname]) | ||
unit2_sent = restore_range(s2_toks, unit2_sent, tid2string[docname]) | ||
plain = unit1_txt + unit2_txt + unit1_sent + unit2_sent | ||
plain = plain.replace("<*>","").replace(" ","") | ||
doc_len += len(plain) | ||
fields = doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label | ||
line = "\t".join(fields) | ||
if doc_len != underscore_len and len(violation_rows) == 0: | ||
violation_rows.append(str(l) + ": " + line) | ||
output.append(line) | ||
|
||
else: | ||
tokfile = True if ".tok" in file_ else False | ||
output = [] | ||
parse_text = "" | ||
docname = "" | ||
for line in lines: | ||
line = line.strip() | ||
if "# newdoc id " in line: | ||
tid = 0 | ||
if parse_text !="": | ||
if not tokfile: | ||
token_dict[docname] = parse_text | ||
parse_text = "" | ||
docname = re.search(r'# newdoc id ?= ?([^\s]+)',line).group(1) | ||
if "GUM" in docname and "reddit" not in docname: | ||
output.append(line) | ||
continue | ||
if docname not in text_dict: | ||
raise IOError("! Text for document name " + docname + " not found.\n Please check that your LDC data contains the file for this document.\n") | ||
if ".tok" in file_: | ||
if docname not in token_dict: # Fetch continuous token string from conllu | ||
parse_conllu = open(os.sep.join([script_dir,"..","..","..","dep",docname + ".conllu"])).read() | ||
toks = [l.split("\t") for l in parse_conllu.split("\n") if "\t" in l] | ||
toks = [l[1] for l in toks if "-" not in l[0] and "." not in l[0]] | ||
toks = "".join(toks) | ||
token_dict[docname] = toks | ||
text = token_dict[docname] | ||
else: | ||
text = text_dict[docname] | ||
doc_len = len(text) | ||
underscore_len = 0 | ||
|
||
if "GUM" in docname and "reddit" not in docname: | ||
output.append(line) | ||
continue | ||
|
||
if line.startswith("# text"): | ||
m = re.match(r'(# ?text ?= ?)(.+)',line) | ||
if m is not None: | ||
i = 0 | ||
sent_text = "" | ||
for char in m.group(2).strip(): | ||
if char != " ": | ||
try: | ||
sent_text += text[i] | ||
except: | ||
raise IOError("Can't fix") | ||
i+=1 | ||
else: | ||
sent_text += " " | ||
line = m.group(1) + sent_text | ||
output.append(line) | ||
elif "\t" in line: | ||
fields = line.split("\t") | ||
if skiplen < 1: | ||
underscore_len += len(fields[1]) | ||
fields[1] = text[:len(fields[1])] | ||
if not "-" in fields[0] and not "." in fields[0]: | ||
parse_text += fields[1] | ||
tid += 1 | ||
tid2string[docname][tid] = fields[1] | ||
if not tokfile: | ||
if fields[2] == '_' and not "-" in fields[0] and not "." in fields[0]: | ||
fields[2] = fields[1] | ||
elif fields[2] == "*LOWER*": | ||
fields[2] = fields[1].lower() | ||
if skiplen < 1: | ||
text = text[len(fields[1]):] | ||
else: | ||
skiplen -=1 | ||
output.append("\t".join(fields)) | ||
if "-" in fields[0]: # Multitoken | ||
start, end = fields[0].split("-") | ||
start = int(start) | ||
end = int(end) | ||
skiplen = end - start + 1 | ||
else: | ||
output.append(line) | ||
|
||
if not doc_len == underscore_len: | ||
if ".rels" in file_: | ||
sys.stderr.write( | ||
"\n! Tried to restore file " + os.path.basename(file_) + " but source text has different length than tokens in shared task file:\n" + \ | ||
" Source text in data/: " + str(doc_len) + " non-whitespace characters\n" + \ | ||
" Token underscores in " + file_ + ": " + str(underscore_len) + " non-whitespace characters\n" + \ | ||
" Violation row: " + violation_rows[0]) | ||
else: | ||
sys.stderr.write("\n! Tried to restore document " + docname + " but source text has different length than tokens in shared task file:\n" + \ | ||
" Source text in data/: " + str(doc_len) + " non-whitespace characters\n" + \ | ||
" Token underscores in " + file_+": " + str(underscore_len) + " non-whitespace characters\n") | ||
with io.open("debug.txt",'w',encoding="utf8") as f: | ||
f.write(text_dict[docname]) | ||
f.write("\n\n\n") | ||
f.write(parse_text) | ||
sys.exit(0) | ||
|
||
if not tokfile and parse_text != "": | ||
token_dict[docname] = parse_text | ||
|
||
with io.open(file_, 'w', encoding='utf8', newline="\n") as fout: | ||
fout.write("\n".join(output) + "\n") | ||
|
||
print("o Restored text for DISRPT format in " + \ | ||
#str(len(dep_files)) + " .conllu files, " + \ | ||
str(len(tok_files)) + " .tok files and "+ str(len(rel_files)) + " .rels files\n") | ||
|
||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# GUM corpus - ANNIS version (without Reddit) | ||
|
||
The zip file in this directory can be imported for search and visualization into [ANNIS](https://corpus-tools.org/annis/). Note that this version of the corpus does not include the Reddit subcorpus of GUM. To compile an ANNIS version of the corpus including the Reddit subcorpus, please see [README_reddit.md](https://github.com/amir-zeldes/gum/blob/master/README_reddit.md). |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# GUM corpus - PAULA XML version (without Reddit) | ||
|
||
The zip file in this directory contains a complete version of all annotations in [PAULA standoff XML](https://github.com/korpling/paula-xml). However note that this version of the corpus does not include the Reddit subcorpus of GUM. To compile a PAULA version of the entire corpus including the Reddit subcorpus, please see [README_reddit.md](https://github.com/amir-zeldes/gum/blob/master/README_reddit.md). |