plain">
CONLL 2000 CHUNKING DATA
http://cnts.uia.ac.be/conll2000/chunking/
Erik Tjong Kim Sang <erikt@uia.ua.ac.be>

Text chunking consists of dividing a text in syntactically correlated
parts of words. For example, the sentence He reckons the current account
deficit will narrow to only # 1.8 billion in September . can be divided
as follows:

    [NP He ] [VP reckons ] [NP the current account deficit ] [VP will
    narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] . 

Text chunking is an intermediate step towards full parsing. It was the
shared task for CoNLL-2000[http://cnts.uia.ac.be/conll2000/]. Training
and test data for this task is available. This data consists of the
same partitions of the Wall Street Journal corpus (WSJ) as the widely
used data for noun phrase chunking: sections 15-18 as training data
(211727 tokens) and section 20 as test data (47377 tokens). The
annotation of the data has been derived from the WSJ corpus by a
program written by Sabine Buchholz from Tilburg University, The
Netherlands.

The goal of this task is to come forward with machine learning methods
which after a training phase can recognize the chunk segmentation of the
test data as well as possible. The training data can be used for
training the text chunker. The chunkers will be evaluated with the F
rate, which is a combination of the precision and recall rates: F =
2*precision*recall / (recall+precision) [Rij79]. The precision and
recall numbers will be computed over all types of chunks.

Background Information

In 1991, Steven Abney proposed to approach parsing by starting with
finding correlated chunks of words [Abn91]. Lance Ramshaw and Mitch
Marcus have approached chunking by using a machine learning method
[RM95]. Their work has inspired many others to study the application
of learning methods to noun phrase chunking
[http://lcg-www.uia.ac.be/~erikt/research/np-chunking.html].  Other
chunk types have not received the same attention as NP chunks. The
most complete work is [BVD99] which presents results for NP, VP, PP,
ADJP and ADVP chunks. [Vee99] works with NP, VP and PP chunks. [RM95]
have recognized arbitrary chunks but classified every non-NP chunk as
VP chunk. [Rat98] has recognized arbitrary chunks as part of a parsing
task but did not report on the chunking performance.

Software and Data

The train and test data consist of three columns separated by spaces.
Each word has been put on a separate line and there is an empty line
after each sentence. The first column contains the current word, the
second its part-of-speech tag as derived by the Brill tagger and the
third its chunk tag as derived from the WSJ corpus. The chunk tags
contain the name of the chunk type, for example I-NP for noun phrase
words and I-VP for verb phrase words. Most chunk types have two types of
chunk tags, B-CHUNK for the first word of the chunk and I-CHUNK for each
other word in the chunk. Here is an example of the file format:

   He        PRP  B-NP
   reckons   VBZ  B-VP
   the       DT   B-NP
   current   JJ   I-NP
   account   NN   I-NP
   deficit   NN   I-NP
   will      MD   B-VP
   narrow    VB   I-VP
   to        TO   B-PP
   only      RB   B-NP
   #         #    I-NP
   1.8       CD   I-NP
   billion   CD   I-NP
   in        IN   B-PP
   September NNP  B-NP
   .         .    O

The O chunk tag is used for tokens which are not part of any chunk.
Instead of using the part-of-speech tags of the WSJ corpus, the data set
used tags generated by the Brill tagger. The performance with the corpus
tags will be better but it will be unrealistic since for novel text no
perfect part-of-speech tags will be available.

    * http://lcg-www.uia.ac.be/conll2000/chunking/train.txt.gz
      http://lcg-www.uia.ac.be/conll2000/chunking/test.txt.gz
      The train and test data for this task. The first two columns have
      been extracted from the [RM95] NP chunking data which is available
      from: ftp://ftp.cis.upenn.edu/pub/chunker/
    * http://ilk.kub.nl/~sabine/chunklink/
      The Perl script that was used for generating these training and
      test data sets from the Penn Treebank. It has been written by
      Sabine Buchholz from Tilburg University.
    * http://lcg-www.uia.ac.be/conll2000/chunking/conlleval.txt
      A Perl script for performance measuring. There is an output
example <output.html> available for this evaluation software.


Results

Eleven systems have been applied to the CoNLL-2000 shared task. The
systems used a wide variety of techniques. Here is an overview of the
performance of these 11 systems on the test set together with other
results (*) on this data set published after the workshop:

              +-----------+-----------++-----------++
              | precision |   recall  ||     F     ||
   +----------+-----------+-----------++-----------++
   | [ZDJ01]  |   94.29%  |   94.01%  ||   94.13   || (*)
   | [KM01]   |   93.89%  |   93.92%  ||   93.91   || (*)
   | [KM00]   |   93.45%  |   93.51%  ||   93.48   ||
   | [Hal00]  |   93.13%  |   93.51%  ||   93.32   ||
   | [TKS00]  |   94.04%  |   91.00%  ||   92.50   ||
   | [ZST00]  |   91.99%  |   92.25%  ||   92.12   ||
   | [Dej00]  |   91.87%  |   92.31%  ||   92.09   ||
   | [Koe00]  |   92.08%  |   91.86%  ||   91.97   ||
   | [Osb00]  |   91.65%  |   92.23%  ||   91.94   ||
   | [VB00]   |   91.05%  |   92.03%  ||   91.54   ||
   | [PMP00]  |   90.63%  |   89.65%  ||   90.14   ||
   | [Joh00]  |   86.24%  |   88.25%  ||   87.23   ||
   | [VD00]   |   88.82%  |   82.91%  ||   85.76   ||
   +----------+-----------+-----------++-----------++
   | baseline |   72.58%  |   82.14%  ||   77.07   ||
   +----------+-----------+-----------++-----------++

The baseline result was obtained by selecting the chunk tag which was
most frequently associated with the current part-of-speech tag. At the
workshop, all 11 systems outperformed the baseline. Most of them (six of
the eleven) obtained an F-score between 91.5 and 92.5. Two systems
performed a lot better: Support Vector Machines used by Kudoh and
Matsumoto [KM00] and Weighted Probability Distribution Voting used by
Van Halteren [Hal00]. The papers associated with the participating
systems can be found in the reference section below.


Related information

    * http://lcg-www.uia.ac.be/conll2000/
      Home page of the workshop on Computational Natural Language
      Learning (CoNLL-2000)
    * http://lcg-www.uia.ac.be/~erikt/research/np-chunking.html
      Information on NP chunking.
    * http://lcg-www.uia.ac.be/lcg/
      Home page of the TMR network - Learning Computational Grammars.
    * http://ilk.kub.nl/cgi-bin/chunkdemo/demo.pl
      A demo from Tilburg University of a set of memory-based learning
      programs that perform tagging, chunking and detection of subjects
      and objects.


References

This reference section contains two parts: first the papers from the
shared task session at CoNLL-2000 and then the other related publications.

      CoNLL-2000 Shared Task Papers

[TB00]
      Erik F. Tjong Kim Sang and Sabine Buchholz, Introduction to the
      CoNLL-2000 Shared Task: Chunking. In: Proceedings of CoNLL-2000
      and LLL-2000, Lisbon, Portugal, 2000.
      [abstract <../abstracts/12732tjo.html>] [ps <../ps/12732tjo.ps>]
      [pdf <../pdf/12732tjo.pdf>]
[Dej00]
      Hervé Déjean, Learning Syntactic Structures with XML. In:
      Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000.
      [ps <../ps/13335dej.ps>] [pdf <../pdf/13335dej.pdf>] [test data
      output <results/13335dej.txt>]
[Joh00]
      Christer Johansson, A Context Sensitive Maximum Likelihood
      Approach to Chunking. In: Proceedings of CoNLL-2000 and LLL-2000,
      Lisbon, Portugal, 2000.
      [ps <../ps/13638joh.ps>] [pdf <../pdf/13638joh.pdf>] [test data
      output <results/13638joh.txt>]
[Koe00]
      Rob Koeling, Chunking with Maximum Entropy Models. In: Proceedings
      of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000.
      [ps <../ps/13941koe.ps>] [pdf <../pdf/13941koe.pdf>] [test data
      output <results/13941koe.txt>]
[KM00]
      Taku Kudoh and Yuji Matsumoto, Use of Support Vector Learning for
      Chunk Identification. In: Proceedings of CoNLL-2000 and LLL-2000,
      Lisbon, Portugal, 2000.
      [ps <../ps/14244kud.ps>] [pdf <../pdf/14244kud.pdf>] [test data
      output <results/14244kud.txt>]
[Osb00]
      Miles Osborne, Shallow Parsing as Part-of-Speech Tagging. In:
      Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000.
      [abstract <../abstracts/14547osb.html>] [ps <../ps/14547osb.ps>]
      [pdf <../pdf/14547osb.pdf>] [test data output <results/14547osb.txt>]
[PMP00]
      Ferran Pla, Antonio Molina and Natividad Prieto, Improving
      Chunking by Means of Lexical-Contextual Information in Statistical
      Language Models. In: Proceedings of CoNLL-2000 and LLL-2000,
      Lisbon, Portugal, 2000.
      [ps <../ps/14850pla.ps>] [pdf <../pdf/14850pla.pdf>] [test data
      output <results/14850pla.txt>]
[TKS00]
      Erik F. Tjong Kim Sang, Text Chunking by System Combination. In:
      Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000.
      [ps <../ps/15153tjo.ps>] [pdf <../pdf/15153tjo.pdf>] [test data
      output <results/15153tjo.txt>]
[Hal00]
      Hans van Halteren, Chunking with WPDV Models. In: Proceedings of
      CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000.
      [ps <../ps/15456van.ps>] [pdf <../pdf/15456van.pdf>] [test data
      output <results/15456van.txt>]
[VB00]
      Jorn Veenstra and Antal van den Bosch, Single-Classifier
      Memory-Based Phrase Chunking. In: Proceedings of CoNLL-2000 and
      LLL-2000, Lisbon, Portugal, 2000.
      [ps <../ps/15759vee.ps>] [pdf <../pdf/15759vee.pdf>] [test data
      output <results/15759vee.txt>]
[VD00]
      Marc Vilain and David Day, Phrase Parsing with Rule Sequence
      Processors: an Application to the Shared CoNLL Task. In:
      Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000.
      [ps <../ps/16062vil.ps>] [pdf <../pdf/16062vil.pdf>] [test data
      output <results/16062vil.txt>]
[ZST00]
      GuoDong Zhou, Jian Su and TongGuan Tey, Hybrid Text Chunking. In:
      Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000.
      [abstract <../abstracts/16366zho.html>] [ps <../ps/16366zho.ps>]
[pdf <../pdf/16366zho.pdf>] [test data output <results/16366zho.txt>]


      Other related publications

[Abn91]
      Steven Abney, Parsing By Chunks. In: Robert Berwick and Steven
      Abney and Carol Tenny, "Principle-Based Parsing", Kluwer Academic
      Publishers, 1991.
      http://whorf.sfs.nphil.uni-tuebingen.de/~abney/Abney_90e.ps.gz
[Bel01]
      Anja Belz, Optimisation of corpus-derived probabilistic grammars,
      In: "Corpus Linguistics 2001", Lancaster, UK, 2001.
      http://lcg-www.uia.ac.be/lcg/ps/belz.cl2001.ps.gz
[BVD99]
      Sabine Buchholz, Jorn Veenstra and Walter Daelemans, Cascaded
      Grammatical Relation Assignment. In: "Proceedings of
      EMNLP/VLC-99", University of Maryland, USA, 1999.
      ftp://ilk.kub.nl/pub/papers/ilk.9908.ps.gz
[Dej02]
      Hervé Déjean, Learning Rules and Their Exceptions. In Journal of
      Machine Learning Research, volume 2 (March), 2002, pp. 669-693.
      http://www.ai.mit.edu/projects/jmlr/papers/volume2/dejean02a/dejean02a.pdf

[FHN00]
      Radu Florian, John C. Henderson and Grace Ngai, Coaxing
      Confidences from an Old Friend: Probabilistic Classifications from
      Transformation Rule Lists. In: "Proceedings of EMNLP 2000", Hong
      Kong, 2000.
      http://arXiv.org/ps/cs/0104020
[KM01]
      Taku Kudoh and Yuji Matsumoto, Chunking with Support Vector
      Machines, In: "Proceedings of NAACL 2001", Pittsburgh, PA, USA, 2001.
      http://cactus.aist-nara.ac.jp/~taku-ku/publication/naacl2001.ps
[Meg02]
      Beáta Megyesi, Shallow Parsing with PoS Taggers and Linguistic
      Features. In Journal of Machine Learning Research, volume 2
      (March), 2002, pp. 639-668.
      http://www.ai.mit.edu/projects/jmlr/papers/volume2/megyesi02a/megyesi02a.pdf

[MP02]
      Antonio Molina and Ferran Pla, Shallow Parsing using Specialized
      HMMs, In Journal of Machine Learning Research, volume 2 (March),
      2002, pp. 595-613.
      http://www.ai.mit.edu/projects/jmlr/papers/volume2/molina02a/molina02a.pdf

[NF01]
      Grace Ngai and Radu Florian. Transformation Based Learning in the
      Fast Lane. In: "Proceedings of NAACL 2001", Pittsburgh, PA, USA, 2001.
      http://nlp.cs.jhu.edu/~rflorian/papers/naacl01.ps
[Osb02]
      Miles Osborne, Shallow Parsing using Noisy and Non-Stationary
      Training Material. In Journal of Machine Learning Research, volume
      2 (March), 2002, pp. 695-719.
      http://www.ai.mit.edu/projects/jmlr/papers/volume2/osborne02a/osborne02a.pdf

[RM95]
      Lance A. Ramshaw and Mitchell P. Marcus, Text Chunking Using
      Transformation-Based Learning. In: "Proceedings of the Third ACL
      Workshop on Very Large Corpora", Cambridge MA, USA, 1995.
      ftp://ftp.cis.upenn.edu/pub/chunker/wvlcbook.ps.gz
[Rat98]
      Adwait Ratnaparkhi, "Maximum Entropy Models for Natural Language
      Ambiguity Resolution". PhD thesis, University of Pennsylvania, 1998.
      ftp://ftp.cis.upenn.edu/pub/ircs/tr/98-15/98-15.ps.gz
[Rij79]
      C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.
[TKS02]
      Erik F. Tjong Kim Sang, Memory-Based Shallow Parsing, In Journal
      of Machine Learning Research, volume 2 (March), 2002, pp. 559-594.
      http://arXiv.org/abs/cs.CL/0204049
[Vee99]
      Jorn Veenstra. Memory-Based Text Chunking, In: Nikos Fakotakis
      (ed), "Machine learning in human language technology", workshop at
      ACAI 99, Chania, Greece, 1999.
      http://ilk.kub.nl/~ilk/papers/ACAI.ps
[ZDJ01]
      Tong Zhang, Fred Damerau and David Johnson, Text Chunking using
      Regularized Winnow. In: Proceedings of ACL-2001, Toulouse, France,
      2001.
[ZDJ02]
      Tong Zhang, Fred Damerau and David Johnson, Text Chunking based on
      a Generalization of Winnow. In Journal of Machine Learning
      Research, volume 2 (March), 2002, pp. 615-637.
      http://www.ai.mit.edu/projects/jmlr/papers/volume2/zhang02c/zhang02c.pdf


------------------------------------------------------------------------