This is the Citation Function Classification corpus [CFC corpus],
created by Simone Teufel, Advaith Siddharthan and Dan Tidhar. It
consists of 161 CFC-annotated conference articles in computational
linguistics, originally drawn from the Cmplg arXiv.  The corpus is
distributed under the Creative Commons Attribution-NonCommercial 2.0
UK:England and Wales Licence (CC BY-NC 2.0 UK).

The citation function classification experiments we performed with the
corpus were first published as

* S.Teufel, A.Siddharthan, D.Tidhar Automatic classification of citation
function. In: "Proceedings of EMNLP-06", Sydney, Australia, 2006.

The corpus and its creation is described in detail in the book

* S.Teufel, The structure of scientific articles: Applications to
Indexing and Summarization, CSLI Publications, 2010.

Me and my coauthors would appreciate a curtesy email if and when you
download the CFC corpus, and definitely when you publish new research
using this corpus.

The corpus is in the Scixml format created by Simone Teufel and also
defined in the above book. A dtd for this corpus, paper-structure.dtd,
which includes definitions for citation function annotation, is given
in the two data directories. The file extensions of the data files are
accordingly "cfc-scixml".

Citation function is encoded as one of 12 categories, in the XML
attribute "CFunc" that sits on REF XML elements (in-line
references). Annotators were encouraged to give a reason for the class
they assigned. This is encoded in the "LinkS" XML attribute.

The corpus comes in two parts. 

  * 2006_paper_training -- 116 files used in the above-mentioned
    paper. They are a mix of production-mode annotated files (one
    annotator only), annotation files as given on page 233 of the
    book, and re-annotated development files (cf. p233ff).

  * additional -- shortly after publication of the paper, we annotated
    45 additional files from cmplg archive. If all 161 files are used
    in replication experiments, it is therefore obvious that the
    numbers will not be numerically comparable to our published
    results.

You should see the following when you untar:

README

2006_paper_training:
9405001.cfc-scixml  9412008.cfc-scixml	9504002.cfc-scixml  9703002.cfc-scixml
9405002.cfc-scixml  9502004.cfc-scixml	9504006.cfc-scixml  9704002.cfc-scixml
9405004.cfc-scixml  9502005.cfc-scixml	9504007.cfc-scixml  9704008.cfc-scixml
9405010.cfc-scixml  9502006.cfc-scixml	9504017.cfc-scixml  9706013.cfc-scixml
9405013.cfc-scixml  9502009.cfc-scixml	9504024.cfc-scixml  9707009.cfc-scixml
9405022.cfc-scixml  9502014.cfc-scixml	9504026.cfc-scixml  9711010.cfc-scixml
9405023.cfc-scixml  9502015.cfc-scixml	9504027.cfc-scixml  9806001.cfc-scixml
9405028.cfc-scixml  9502018.cfc-scixml	9504030.cfc-scixml  9806019.cfc-scixml
9405033.cfc-scixml  9502021.cfc-scixml	9504033.cfc-scixml  9807001.cfc-scixml
9405035.cfc-scixml  9502022.cfc-scixml	9504034.cfc-scixml  9808008.cfc-scixml
9407011.cfc-scixml  9502023.cfc-scixml	9505001.cfc-scixml  9808009.cfc-scixml
9408003.cfc-scixml  9502024.cfc-scixml	9505011.cfc-scixml  9808012.cfc-scixml
9408004.cfc-scixml  9502031.cfc-scixml	9505024.cfc-scixml  9809027.cfc-scixml
9408006.cfc-scixml  9502033.cfc-scixml	9506004.cfc-scixml  9809106.cfc-scixml
9408011.cfc-scixml  9502035.cfc-scixml	9506017.cfc-scixml  9809112.cfc-scixml
9408014.cfc-scixml  9502037.cfc-scixml	9508005.cfc-scixml  9810015.cfc-scixml
9409004.cfc-scixml  9502038.cfc-scixml	9511001.cfc-scixml  9811009.cfc-scixml
9410001.cfc-scixml  9502039.cfc-scixml	9511006.cfc-scixml  9902001.cfc-scixml
9410005.cfc-scixml  9503002.cfc-scixml	9601004.cfc-scixml  9904008.cfc-scixml
9410006.cfc-scixml  9503004.cfc-scixml	9604019.cfc-scixml  9905001.cfc-scixml
9410008.cfc-scixml  9503005.cfc-scixml	9604022.cfc-scixml  9905008.cfc-scixml
9410009.cfc-scixml  9503007.cfc-scixml	9605013.cfc-scixml  9905009.cfc-scixml
9410012.cfc-scixml  9503009.cfc-scixml	9605014.cfc-scixml  9906004.cfc-scixml
9410022.cfc-scixml  9503013.cfc-scixml	9605016.cfc-scixml  9907006.cfc-scixml
9410032.cfc-scixml  9503014.cfc-scixml	9605023.cfc-scixml  9907007.cfc-scixml
9410033.cfc-scixml  9503015.cfc-scixml	9606028.cfc-scixml  9907010.cfc-scixml
9411019.cfc-scixml  9503017.cfc-scixml	9606031.cfc-scixml  paper-structure.dtd
9411021.cfc-scixml  9503018.cfc-scixml	9607001.cfc-scixml
9411023.cfc-scixml  9503023.cfc-scixml	9607019.cfc-scixml
9412005.cfc-scixml  9503025.cfc-scixml	9702002.cfc-scixml

additional:
0001012.cfc-scixml  0006028.cfc-scixml	0008023.cfc-scixml  0011020.cfc-scixml
0003055.cfc-scixml  0006038.cfc-scixml	0008024.cfc-scixml  0102019.cfc-scixml
0003060.cfc-scixml  0006044.cfc-scixml	0008026.cfc-scixml  0102020.cfc-scixml
0003083.cfc-scixml  0007035.cfc-scixml	0008027.cfc-scixml  9407001.cfc-scixml
0005006.cfc-scixml  0008004.cfc-scixml	0008028.cfc-scixml  9907003.cfc-scixml
0005015.cfc-scixml  0008005.cfc-scixml	0008029.cfc-scixml  9907013.cfc-scixml
0005016.cfc-scixml  0008012.cfc-scixml	0008034.cfc-scixml  9912003.cfc-scixml
0005025.cfc-scixml  0008016.cfc-scixml	0008035.cfc-scixml  9912004.cfc-scixml
0006003.cfc-scixml  0008017.cfc-scixml	0009027.cfc-scixml  9912005.cfc-scixml
0006011.cfc-scixml  0008020.cfc-scixml	0010020.cfc-scixml  paper-structure.dtd
0006019.cfc-scixml  0008021.cfc-scixml	0011001.cfc-scixml
0006021.cfc-scixml  0008022.cfc-scixml	0011007.cfc-scixml


Simone Teufel, August 2014

