Skip to content

Commit

Permalink
Merge pull request EducationalTestingService#35 from mheilman/develop
Browse files Browse the repository at this point in the history
fixed segmenter to work at sentence level, added visualization scripts, added correct license info
  • Loading branch information
mheilman committed Nov 7, 2014
2 parents 9d342b8 + 362c922 commit 2dfd368
Show file tree
Hide file tree
Showing 32 changed files with 633 additions and 297 deletions.
5 changes: 3 additions & 2 deletions LICENSE → LICENSE.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
The MIT License (MIT)

Copyright (c) 2014 Educational Testing Service
Copyright (c) 2014 Educational Testing Service and University of Southern
California

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand All @@ -18,4 +19,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
SOFTWARE.
13 changes: 12 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@

License
=======

This code is licensed under the MIT license (see LICENSE.txt).


Setup
=====

This code requires python 3. I currently use 3.3.5.

This repository is pip-installable. To make it work properly, I recommend running `pip install -e .` to set it up. This will make a local, editable copy in your python environment. See `requirements.txt` for a list of the prerequisite packages. In addition, you may have to install a few NLTK models using `nltk.download()` in python (specifically, punkt and, at least for now, the maxent POS tagger).

Additionally, the syntactic parsing code must be set up to use ZPar. The simplest but least efficient way is to put the ZPar distribution (version 0.6) in a subdirectory `zpar` (or symbolic link) in the current working directory, along with the English models in a subdirectory `zpar/english`. For efficiency, a better method is to use the `python-zpar` wrapper, which is currently available at `https://bitbucket.org/desilinguist/python-zpar`. To set this up, run make and then either a) set an environment variable `ZPAR_LIBRARY_DIR` equal to the directory where `zpar.so` is created (e.g., `/Users/USER1/python-zpar/dist`) to run ZPar as part of the discourse parser, or b) start a separate server using python-zpar's `zpar_server.py`.
Additionally, the syntactic parsing code must be set up to use ZPar. The simplest but least efficient way is to put the ZPar distribution (version 0.6) in a subdirectory `zpar` (or symbolic link) in the current working directory, along with the English models in a subdirectory `zpar/english`. For efficiency, a better method is to use the `python-zpar` wrapper, which is currently available at `https://github.com/EducationalTestingService/python-zpar` or `https://pypi.python.org/pypi/python-zpar/`. To set this up, run make and then either a) set an environment variable `ZPAR_LIBRARY_DIR` equal to the directory where `zpar.so` is created (e.g., `/Users/USER1/python-zpar/dist`) to run ZPar as part of the discourse parser, or b) start a separate server using python-zpar's `zpar_server`.

Finally, CRF++ (version 0.58) should be installed, and its `bin` directory should be added to your `PATH` environment variable. See `http://crfpp.googlecode.com/svn/trunk/doc/index.html`.

Expand Down Expand Up @@ -69,3 +75,8 @@ rst_eval rst_discourse_tb_edus_TRAINING_DEV.json -p rst_parsing_modelC1.0 --use_
This will compute precision, recall, and F1 scores for 3 scenarios: spans labeled with nuclearity and relation types, spans labeled only with nuclearity, and unlabeled token spans. The above version of the command will use gold standard EDUs and syntactic parses.

NOTE: The evaluation script has basic functionality in place, but at the moment it almost certainly does not appropriately handle important edge cases (e.g., same-unit relations, relations at the top of the tree). These issues need to be addressed before the script can be used in experiments.

Visualization
=============

The script `util/visualize_rst_tree.py` can be used to create an HTML/javascript visualization, using D3.js (http://d3js.org/). See the D3.js license: `util/LICENSE_d3.txt`. The input to the script is the output of `rst_parse`. See `util/example.json` for an example input.
1 change: 1 addition & 0 deletions discourseparsing/collapse_rst_labels.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# License: MIT

'''
Script to collapse RST discourse treebank relation types,
Expand Down
1 change: 1 addition & 0 deletions discourseparsing/convert_rst_discourse_tb.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# License: MIT

'''
This script merges the RST Discourse Treebank
Expand Down
34 changes: 3 additions & 31 deletions discourseparsing/discourse_parsing.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,8 @@
# License: MIT

'''
License
-------
Copyright (c) 2014, Educational Testing Service and Kenji Sagae
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Description
-----------
This is a python version of a shift-reduce RST discourse parser,
originally written by Kenji Sagae in perl.
This is a python shift-reduce RST discourse parser based partly on a perl
parser written by Kenji Sagae.
'''

import os
Expand Down
85 changes: 39 additions & 46 deletions discourseparsing/discourse_segmentation.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# License: MIT

from tempfile import NamedTemporaryFile
import shlex
Expand Down Expand Up @@ -26,12 +27,21 @@ def parse_node_features(nodes):

def extract_segmentation_features(doc_dict):
'''
This extracts features for use in the discourse segmentation CRF. Note that
the CRF++ template makes it so that the features for the current word and
2 previous and 2 next words are used for each word.
:param doc_dict: A dictionary of edu_start_indices, tokens, syntax_trees,
token_tree_positions, and pos_tags for a document, as
extracted by convert_rst_discourse_tb.py.
token_tree_positions, and pos_tags for a document, as
extracted by convert_rst_discourse_tb.py.
:returns: a list of lists of lists of features (one feature list per word
per sentence), and a list of lists of labels (one label per word
per sentence)
'''
labels = []
feat_lists = []

labels_doc = []
feat_lists_doc = []

if 'edu_start_indices' in doc_dict:
edu_starts = {(x[0], x[1]) for x in doc_dict['edu_start_indices']}
else:
Expand All @@ -43,14 +53,16 @@ def extract_segmentation_features(doc_dict):
doc_dict['syntax_trees'],
doc_dict['token_tree_positions'],
doc_dict['pos_tags'])):

labels_sent = []
feat_lists_sent = []

tree = HeadedParentedTree.fromstring(tree_str)
for token_num, (token, tree_position, pos_tag) \
in enumerate(zip(sent_tokens, sent_tree_positions, pos_tags)):
feats = []
label = 'B-EDU' if (sent_num, token_num) in edu_starts else 'C-EDU'

# TODO: all of the stuff below needs to be checked

# POS tags and words for lexicalized parse nodes
# from 3.2 of Bach et al., 2012.
# preterminal node for the current word
Expand All @@ -76,17 +88,18 @@ def extract_segmentation_features(doc_dict):
# now make the list of features
feats.append(token.lower())
feats.append(pos_tag)
feats.append('B-SENT' if token_num == 0 else 'C-SENT')
feats.extend(parse_node_features([node_p,
ancestor_w,
ancestor_r,
node_p_parent,
node_p_right_sibling]))

feat_lists.append(feats)
labels.append(label)
feat_lists_sent.append(feats)
labels_sent.append(label)
feat_lists_doc.append(feat_lists_sent)
labels_doc.append(labels_sent)

return feat_lists, labels
return feat_lists_doc, labels_doc


class Segmenter():
Expand All @@ -98,27 +111,24 @@ def segment_document(self, doc_dict):
logging.info('segmenting document, doc_id = {}'.format(doc_id))

# Extract features.
# TODO interact with crf++ via cython, etc.?
tmpfile = NamedTemporaryFile('w')
feat_lists, _ = extract_segmentation_features(doc_dict)
for feat_list in feat_lists:
print('\t'.join(feat_list + ["?"]), file=tmpfile)
feat_lists_doc, _ = extract_segmentation_features(doc_dict)
for feat_lists_sent in feat_lists_doc:
for feat_list_word in feat_lists_sent:
print('\t'.join(feat_list_word + ["?"]), file=tmpfile)
print('\n', file=tmpfile)
tmpfile.flush()

# Get predictions from the CRF++ model.
# TODO interact with crf++ via cython, etc.?
crf_output = subprocess.check_output(shlex.split(
'crf_test -m {} {}'.format(self.model_path, tmpfile.name))) \
.decode('utf-8').strip()
tmpfile.close()

# an index into the list of tokens for this document indicating where
# the current sentence started
sent_start_index = 0

# an index into the list of sentences
sent_num = 0

edu_number = 0
edu_num = 0

# Check that the input is not blank.
all_tokens = doc_dict['tokens']
Expand All @@ -128,33 +138,16 @@ def segment_document(self, doc_dict):

# Construct the set of EDU start index tuples (sentence number, token
# number, EDU number).
cur_sent = all_tokens[0]
edu_start_indices = []
for tok_index, line in enumerate(crf_output.split('\n')):
if tok_index - sent_start_index >= len(cur_sent):
sent_start_index += len(cur_sent)
sent_num += 1
cur_sent = all_tokens[sent_num] if sent_num < len(
all_tokens) else None
# Start a new EDU where the CRF predicts "B-EDU".
# Also, force new EDUs to start at the beginnings of sentences to
# account for the rare cases where the CRF does not predict "B-EDU"
# at the beginning of a new sentence (CRF++ can only learn this as
# a soft constraint).
start_of_sentence = (tok_index - sent_start_index == 0)
token_label = line.split()[-1]
if token_label == "B-EDU" or start_of_sentence:
if start_of_sentence and token_label != "B-EDU":
logging.info(("The CRF segmentation model did not" +
" predict B-EDU at the start of a" +
" sentence. A new EDU will be started" +
" regardless, to ensure consistency with" +
" the RST annotations. doc_id = {}")
.format(doc_id))

edu_start_indices.append(
(sent_num, tok_index - sent_start_index, edu_number))
edu_number += 1

for sent_num, crf_output_sent in enumerate(crf_output.split('\n\n')):
for tok_num, line in enumerate(crf_output_sent.split('\n')):
# Start a new EDU where the CRF predicts "B-EDU" and
# at the beginnings of sentences.
token_label = line.split()[-1]
if token_label == "B-EDU" or tok_num == 0:
edu_start_indices.append((sent_num, tok_num, edu_num))
edu_num += 1

# Check that all sentences are covered by the output list of EDUs,
# and that every new sentence starts an EDU.
Expand Down
1 change: 1 addition & 0 deletions discourseparsing/extract_actions_from_trees.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# License: MIT

'''
A script for converting the RST discourse treebank into a gold standard
Expand Down
13 changes: 7 additions & 6 deletions discourseparsing/extract_segmentation_features.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# License: MIT

'''
A discourse segmenter following the Base model from this paper:
Expand Down Expand Up @@ -30,12 +31,12 @@ def main():

with open(args.output_path, 'w') as outfile:
for doc in data:
feat_lists, labels = extract_segmentation_features(doc)
for feat_list, label in zip(feat_lists, labels):
print('\t'.join(feat_list + [label]), file=outfile)

print('\t'.join(['' for x in range(len(feat_lists[0]) + 1)]),
file=outfile)
feat_lists_doc, labels_doc = extract_segmentation_features(doc)
for feat_lists_sent, labels_sent in zip(feat_lists_doc, labels_doc):
for feat_list, label in zip(feat_lists_sent, labels_sent):
print('\t'.join(feat_list + [label]), file=outfile)
# blank lines between sentences (and documents)
print(file=outfile)


if __name__ == '__main__':
Expand Down
1 change: 1 addition & 0 deletions discourseparsing/io_util.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# License: MIT

import cchardet
import logging
Expand Down
7 changes: 5 additions & 2 deletions discourseparsing/make_segmentation_crfpp_template.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
#!/usr/bin/env python3
# License: MIT

'''
This generates a feature template file for CRF++.
See http://crfpp.googlecode.com/svn/trunk/doc/index.html.
It will need to be rerun if new features are added to the segmenter.
It needs to be rerun when the segmenter feature set changes.
'''

import argparse
Expand All @@ -12,6 +13,8 @@
def make_segmentation_crfpp_template(output_path, num_features=13):
with open(output_path, 'w') as outfile:
for i in range(num_features):
# This makes it so the features for the current word are based
# on the current word and the previous 2 and next 2 words.
for j in [-2, -1, 0, 1, 2]:
print('U{:03d}{}:%x[{},{}]'.format(i, j + 2, j, i),
file=outfile)
Expand All @@ -25,7 +28,7 @@ def main():
'--output_path',
help='A path to where the CRF++ template file should be created.',
default='segmentation_crfpp_template.txt')
parser.add_argument('--num_features', type=int, default=13)
parser.add_argument('--num_features', type=int, default=12)
args = parser.parse_args()
make_segmentation_crfpp_template(args.output_path, args.num_features)

Expand Down
1 change: 1 addition & 0 deletions discourseparsing/make_traindev_split.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# License: MIT

'''
A script to split up the official RST discourse treebank training set into a
Expand Down
1 change: 1 addition & 0 deletions discourseparsing/paragraph_splitting.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# License: MIT

import logging
import re
Expand Down
1 change: 1 addition & 0 deletions discourseparsing/parse_util.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# License: MIT

import ctypes as c
import socket
Expand Down
Loading

0 comments on commit 2dfd368

Please sign in to comment.