Skip to content

Commit

Permalink
dataset structure
Browse files Browse the repository at this point in the history
  • Loading branch information
stochastic-sisyphus committed Dec 11, 2024
1 parent e29fc41 commit f5a140b
Showing 1 changed file with 61 additions and 0 deletions.
61 changes: 61 additions & 0 deletions data/scisummnet_release1.1__20190413/Dataset_Documentation.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
===============
Dataset format
===============

We provide the 1000 most cited papers in the ACL Anthology Network (AAN) (Radev et al., 2013) with their annotated citation information and gold summaries.

The `top1000_complete` directory contains a subdirectory for each of the reference papers (RPs), denoted by the ACL Anthology paper id. Each such subdirectory contains:

- Documents_xml/ : containing XML file of the RP
- citing_sentences_annotated.json : incoming citing sentences of the RP (details provided below)
- summary/ : containing annotated gold summary


================================
citing_sentences_annotated.json
================================

For each citing sentence, we provide:

- "citing_paper_id": the ACL Anthology paper id of the citing paper (CP)
- "raw_text": the raw text of the citing sentence, as extracted from the AAN
- "clean_text": cleaned citing sentence. if empty, that means the citing sentence is broken (e.g., citing a wrong paper) or inappropriate (a table or list citation)
- "keep_for_gold": whether or not to consider this citing sentence in gold summary

"clean_text" and "keep_for_gold" were annotated by humans with good background knowledge in the computational linguistics domain.

We also added the following information for each citing sentence, after the annotation was completed (so it was not provided for annotators):

- "citing_paper_authority": authority score of the CP (i.e., citation count of the CP)
- "citing_paper_authors": authors of the CP

This can be useful for conducting additional analysis of citation sentences.



===================
Dataset statistics
===================

Number of RPs: 1009
Citation counts of RPs: 21 - 928 within AAN (average: 61)

For each RP,
Length of total text (average): 4417 words
Length of abstract (average): 110 words
Number of sampled citing sentences (average): 15
Number of citing sentences selected for summary (average): 2
Inter-annotator agreement (Kappa): 0.75
Length of gold summary (average): 151 words


==============
Usage example
==============

Given a reference paper (RP) to summarize, its xml file (the text of the RP) and json file (incoming citation sentences) can be used as the input, and the annotated summary can be used as the comprehensive gold summary to learn or to evaluate on.

Example: in our submitted paper, given a RP to summarize, we take its abstract (from the xml file) and all the cleaned citing sentences and authority scores (from the json file) as the input. We then train summarization models on the annotated summaries so that the models learn to extract summary-worthy sentences. Please refer to our paper for the details of the proposed summarization models.



0 comments on commit f5a140b

Please sign in to comment.