Merge pull request EducationalTestingService#35 from mheilman/develop

fixed segmenter to work at sentence level, added visualization scripts, added correct license info
dchartash · Nov 7, 2014 · 2dfd368 · 2dfd368
2 parents 9d342b8 + 362c922
commit 2dfd368
Show file tree

Hide file tree

Showing 32 changed files with 633 additions and 297 deletions.
diff --git a/LICENSE → LICENSE.txt b/LICENSE → LICENSE.txt
@@ -1,6 +1,7 @@
 The MIT License (MIT)
 
-Copyright (c) 2014 Educational Testing Service
+Copyright (c) 2014 Educational Testing Service and University of Southern
+California
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -18,4 +19,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -1,12 +1,18 @@
 
+License
+=======
+
+This code is licensed under the MIT license (see LICENSE.txt).
+
+
 Setup
 =====
 
 This code requires python 3.  I currently use 3.3.5.
 
 This repository is pip-installable.  To make it work properly, I recommend running `pip install -e .` to set it up.  This will make a local, editable copy in your python environment.  See `requirements.txt` for a list of the prerequisite packages.  In addition, you may have to install a few NLTK models using `nltk.download()` in python (specifically, punkt and, at least for now, the maxent POS tagger).
 
-Additionally, the syntactic parsing code must be set up to use ZPar.  The simplest but least efficient way is to put the ZPar distribution (version 0.6) in a subdirectory `zpar` (or symbolic link) in the current working directory, along with the English models in a subdirectory `zpar/english`.  For efficiency, a better method is to use the `python-zpar` wrapper, which is currently available at `https://bitbucket.org/desilinguist/python-zpar`.  To set this up, run make and then either a) set an environment variable `ZPAR_LIBRARY_DIR` equal to the directory where `zpar.so` is created (e.g., `/Users/USER1/python-zpar/dist`) to run ZPar as part of the discourse parser, or b) start a separate server using python-zpar's `zpar_server.py`.
+Additionally, the syntactic parsing code must be set up to use ZPar.  The simplest but least efficient way is to put the ZPar distribution (version 0.6) in a subdirectory `zpar` (or symbolic link) in the current working directory, along with the English models in a subdirectory `zpar/english`.  For efficiency, a better method is to use the `python-zpar` wrapper, which is currently available at `https://github.com/EducationalTestingService/python-zpar` or `https://pypi.python.org/pypi/python-zpar/`.  To set this up, run make and then either a) set an environment variable `ZPAR_LIBRARY_DIR` equal to the directory where `zpar.so` is created (e.g., `/Users/USER1/python-zpar/dist`) to run ZPar as part of the discourse parser, or b) start a separate server using python-zpar's `zpar_server`.
 
 Finally, CRF++ (version 0.58) should be installed, and its `bin` directory should be added to your `PATH` environment variable.  See `http://crfpp.googlecode.com/svn/trunk/doc/index.html`.
 
@@ -69,3 +75,8 @@ rst_eval rst_discourse_tb_edus_TRAINING_DEV.json -p rst_parsing_modelC1.0 --use_
 This will compute precision, recall, and F1 scores for 3 scenarios: spans labeled with nuclearity and relation types, spans labeled only with nuclearity, and unlabeled token spans.  The above version of the command will use gold standard EDUs and syntactic parses.
 
 NOTE: The evaluation script has basic functionality in place, but at the moment it almost certainly does not appropriately handle important edge cases (e.g., same-unit relations, relations at the top of the tree).  These issues need to be addressed before the script can be used in experiments.
+
+Visualization
+=============
+
+The script `util/visualize_rst_tree.py` can be used to create an HTML/javascript visualization, using D3.js (http://d3js.org/).  See the D3.js license: `util/LICENSE_d3.txt`.  The input to the script is the output of `rst_parse`.  See `util/example.json` for an example input.
diff --git a/discourseparsing/collapse_rst_labels.py b/discourseparsing/collapse_rst_labels.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# License: MIT
 
 '''
 Script to collapse RST discourse treebank relation types,

diff --git a/discourseparsing/convert_rst_discourse_tb.py b/discourseparsing/convert_rst_discourse_tb.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# License: MIT
 
 '''
 This script merges the RST Discourse Treebank

diff --git a/discourseparsing/discourse_parsing.py b/discourseparsing/discourse_parsing.py
@@ -1,36 +1,8 @@
+# License: MIT
 
 '''
-License
--------
-Copyright (c) 2014, Educational Testing Service and Kenji Sagae
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met:
-
-1. Redistributions of source code must retain the above copyright notice, this
-   list of conditions and the following disclaimer.
-2. Redistributions in binary form must reproduce the above copyright notice,
-   this list of conditions and the following disclaimer in the documentation
-   and/or other materials provided with the distribution.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
-ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-
-Description
------------
-This is a python version of a shift-reduce RST discourse parser,
-originally written by Kenji Sagae in perl.
-
+This is a python shift-reduce RST discourse parser based partly on a perl
+parser written by Kenji Sagae.
 '''
 
 import os

diff --git a/discourseparsing/discourse_segmentation.py b/discourseparsing/discourse_segmentation.py
@@ -1,3 +1,4 @@
+# License: MIT
 
 from tempfile import NamedTemporaryFile
 import shlex
@@ -26,12 +27,21 @@ def parse_node_features(nodes):
 
 def extract_segmentation_features(doc_dict):
     '''
+    This extracts features for use in the discourse segmentation CRF. Note that
+    the CRF++ template makes it so that the features for the current word and
+    2 previous and 2 next words are used for each word.
+
     :param doc_dict: A dictionary of edu_start_indices, tokens, syntax_trees,
-                token_tree_positions, and pos_tags for a document, as
-                extracted by convert_rst_discourse_tb.py.
+                     token_tree_positions, and pos_tags for a document, as
+                     extracted by convert_rst_discourse_tb.py.
+    :returns: a list of lists of lists of features (one feature list per word
+              per sentence), and a list of lists of labels (one label per word
+              per sentence)
     '''
-    labels = []
-    feat_lists = []
+
+    labels_doc = []
+    feat_lists_doc = []
+
     if 'edu_start_indices' in doc_dict:
         edu_starts = {(x[0], x[1]) for x in doc_dict['edu_start_indices']}
     else:
@@ -43,14 +53,16 @@ def extract_segmentation_features(doc_dict):
                              doc_dict['syntax_trees'],
                              doc_dict['token_tree_positions'],
                              doc_dict['pos_tags'])):
+
+        labels_sent = []
+        feat_lists_sent = []
+
         tree = HeadedParentedTree.fromstring(tree_str)
         for token_num, (token, tree_position, pos_tag) \
                 in enumerate(zip(sent_tokens, sent_tree_positions, pos_tags)):
             feats = []
             label = 'B-EDU' if (sent_num, token_num) in edu_starts else 'C-EDU'
 
-            # TODO: all of the stuff below needs to be checked
-
             # POS tags and words for lexicalized parse nodes
             # from 3.2 of Bach et al., 2012.
             # preterminal node for the current word
@@ -76,17 +88,18 @@ def extract_segmentation_features(doc_dict):
             # now make the list of features
             feats.append(token.lower())
             feats.append(pos_tag)
-            feats.append('B-SENT' if token_num == 0 else 'C-SENT')
             feats.extend(parse_node_features([node_p,
                                               ancestor_w,
                                               ancestor_r,
                                               node_p_parent,
                                               node_p_right_sibling]))
 
-            feat_lists.append(feats)
-            labels.append(label)
+            feat_lists_sent.append(feats)
+            labels_sent.append(label)
+        feat_lists_doc.append(feat_lists_sent)
+        labels_doc.append(labels_sent)
 
-    return feat_lists, labels
+    return feat_lists_doc, labels_doc
 
 
 class Segmenter():
@@ -98,27 +111,24 @@ def segment_document(self, doc_dict):
         logging.info('segmenting document, doc_id = {}'.format(doc_id))
 
         # Extract features.
-        # TODO interact with crf++ via cython, etc.?
         tmpfile = NamedTemporaryFile('w')
-        feat_lists, _ = extract_segmentation_features(doc_dict)
-        for feat_list in feat_lists:
-            print('\t'.join(feat_list + ["?"]), file=tmpfile)
+        feat_lists_doc, _ = extract_segmentation_features(doc_dict)
+        for feat_lists_sent in feat_lists_doc:
+            for feat_list_word in feat_lists_sent:
+                print('\t'.join(feat_list_word + ["?"]), file=tmpfile)
+            print('\n', file=tmpfile)
         tmpfile.flush()
 
         # Get predictions from the CRF++ model.
+        # TODO interact with crf++ via cython, etc.?
         crf_output = subprocess.check_output(shlex.split(
             'crf_test -m {} {}'.format(self.model_path, tmpfile.name))) \
             .decode('utf-8').strip()
         tmpfile.close()
 
-        # an index into the list of tokens for this document indicating where
-        # the current sentence started
-        sent_start_index = 0
-
         # an index into the list of sentences
         sent_num = 0
-
-        edu_number = 0
+        edu_num = 0
 
         # Check that the input is not blank.
         all_tokens = doc_dict['tokens']
@@ -128,33 +138,16 @@ def segment_document(self, doc_dict):
 
         # Construct the set of EDU start index tuples (sentence number, token
         # number, EDU number).
-        cur_sent = all_tokens[0]
         edu_start_indices = []
-        for tok_index, line in enumerate(crf_output.split('\n')):
-            if tok_index - sent_start_index >= len(cur_sent):
-                sent_start_index += len(cur_sent)
-                sent_num += 1
-                cur_sent = all_tokens[sent_num] if sent_num < len(
-                    all_tokens) else None
-            # Start a new EDU where the CRF predicts "B-EDU".
-            # Also, force new EDUs to start at the beginnings of sentences to
-            # account for the rare cases where the CRF does not predict "B-EDU"
-            # at the beginning of a new sentence (CRF++ can only learn this as
-            # a soft constraint).
-            start_of_sentence = (tok_index - sent_start_index == 0)
-            token_label = line.split()[-1]
-            if token_label == "B-EDU" or start_of_sentence:
-                if start_of_sentence and token_label != "B-EDU":
-                    logging.info(("The CRF segmentation model did not" +
-                                  " predict B-EDU at the start of a" +
-                                  " sentence. A new EDU will be started" +
-                                  " regardless, to ensure consistency with" +
-                                  " the RST annotations. doc_id = {}")
-                                 .format(doc_id))
-
-                edu_start_indices.append(
-                    (sent_num, tok_index - sent_start_index, edu_number))
-                edu_number += 1
+
+        for sent_num, crf_output_sent in enumerate(crf_output.split('\n\n')):
+            for tok_num, line in enumerate(crf_output_sent.split('\n')):
+                # Start a new EDU where the CRF predicts "B-EDU" and
+                # at the beginnings of sentences.
+                token_label = line.split()[-1]
+                if token_label == "B-EDU" or tok_num == 0:
+                    edu_start_indices.append((sent_num, tok_num, edu_num))
+                    edu_num += 1
 
         # Check that all sentences are covered by the output list of EDUs,
         # and that every new sentence starts an EDU.

diff --git a/discourseparsing/extract_actions_from_trees.py b/discourseparsing/extract_actions_from_trees.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# License: MIT
 
 '''
 A script for converting the RST discourse treebank into a gold standard

diff --git a/discourseparsing/extract_segmentation_features.py b/discourseparsing/extract_segmentation_features.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# License: MIT
 
 '''
 A discourse segmenter following the Base model from this paper:
@@ -30,12 +31,12 @@ def main():
 
     with open(args.output_path, 'w') as outfile:
         for doc in data:
-            feat_lists, labels = extract_segmentation_features(doc)
-            for feat_list, label in zip(feat_lists, labels):
-                print('\t'.join(feat_list + [label]), file=outfile)
-
-            print('\t'.join(['' for x in range(len(feat_lists[0]) + 1)]),
-                  file=outfile)
+            feat_lists_doc, labels_doc = extract_segmentation_features(doc)
+            for feat_lists_sent, labels_sent in zip(feat_lists_doc, labels_doc):
+                for feat_list, label in zip(feat_lists_sent, labels_sent):
+                    print('\t'.join(feat_list + [label]), file=outfile)
+                # blank lines between sentences (and documents)
+                print(file=outfile)
 
 
 if __name__ == '__main__':

diff --git a/discourseparsing/io_util.py b/discourseparsing/io_util.py
@@ -1,3 +1,4 @@
+# License: MIT
 
 import cchardet
 import logging

diff --git a/discourseparsing/make_segmentation_crfpp_template.py b/discourseparsing/make_segmentation_crfpp_template.py
@@ -1,9 +1,10 @@
 #!/usr/bin/env python3
+# License: MIT
 
 '''
 This generates a feature template file for CRF++.
 See http://crfpp.googlecode.com/svn/trunk/doc/index.html.
-It will need to be rerun if new features are added to the segmenter.
+It needs to be rerun when the segmenter feature set changes.
 '''
 
 import argparse
@@ -12,6 +13,8 @@
 def make_segmentation_crfpp_template(output_path, num_features=13):
     with open(output_path, 'w') as outfile:
         for i in range(num_features):
+            # This makes it so the features for the current word are based
+            # on the current word and the previous 2 and next 2 words.
             for j in [-2, -1, 0, 1, 2]:
                 print('U{:03d}{}:%x[{},{}]'.format(i, j + 2, j, i),
                       file=outfile)
@@ -25,7 +28,7 @@ def main():
         '--output_path',
         help='A path to where the CRF++ template file should be created.',
         default='segmentation_crfpp_template.txt')
-    parser.add_argument('--num_features', type=int, default=13)
+    parser.add_argument('--num_features', type=int, default=12)
     args = parser.parse_args()
     make_segmentation_crfpp_template(args.output_path, args.num_features)
 

diff --git a/discourseparsing/make_traindev_split.py b/discourseparsing/make_traindev_split.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# License: MIT
 
 '''
 A script to split up the official RST discourse treebank training set into a

diff --git a/discourseparsing/paragraph_splitting.py b/discourseparsing/paragraph_splitting.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# License: MIT
 
 import logging
 import re

diff --git a/discourseparsing/parse_util.py b/discourseparsing/parse_util.py
@@ -1,3 +1,4 @@
+# License: MIT
 
 import ctypes as c
 import socket