This is the Citation Function Classification corpus [CFC corpus],
created by Simone Teufel, Advaith Siddharthan and Dan Tidhar. It
consists of 161 CFC-annotated conference articles in computational
linguistics, originally drawn from the Cmplg arXiv. The corpus is
distributed under the Creative Commons Attribution-NonCommercial 2.0
UK:England and Wales Licence (CC BY-NC 2.0 UK).
The citation function classification experiments we performed with the
corpus were first published as
- S.Teufel, A.Siddharthan, D.Tidhar Automatic classification of citation
function. In: "Proceedings of EMNLP-06", Sydney, Australia, 2006.
The corpus and its creation is described in detail in the book
- S.Teufel, The structure of scientific articles: Applications to
Indexing and Summarization, CSLI Publications, 2010.
Me and my coauthors would appreciate a curtesy email if and when you
download the CFC corpus, and definitely when you publish new research
using this corpus.
The corpus is in the Scixml format created by Simone Teufel and also
defined in the above book. A dtd for this corpus, paper-structure.dtd,
which includes definitions for citation function annotation, is given
in the two data directories. The file extensions of the data files are
accordingly "cfc-scixml".
Citation function is encoded as one of 12 categories, in the XML
attribute "CFunc" that sits on REF XML elements (in-line
references). Annotators were encouraged to give a reason for the class
they assigned. This is encoded in the "LinkS" XML attribute.
The corpus comes in two parts:
- 2006_paper_training/ -- 116 files used in the above-mentioned
paper. They are a mix of production-mode annotated files (one
annotator only), annotation files as given on page 233 of the
book, and re-annotated development files (cf. p233ff).
- additional/ -- shortly after publication of the paper, we annotated
45 additional files from cmplg archive. If all 161 files are used
in replication experiments, it is therefore obvious that the
numbers will not be numerically comparable to our published
results.
You should see the following when you untar:
README [the current file]
2006_paper_training:
-
9405001.cfc-scixml
-
9405002.cfc-scixml
-
9405004.cfc-scixml
-
9405010.cfc-scixml
-
9405013.cfc-scixml
-
9405022.cfc-scixml
-
9405023.cfc-scixml
-
9405028.cfc-scixml
-
9405033.cfc-scixml
-
9405035.cfc-scixml
-
9407011.cfc-scixml
-
9408003.cfc-scixml
-
9408004.cfc-scixml
-
9408006.cfc-scixml
-
9408011.cfc-scixml
-
9408014.cfc-scixml
-
9409004.cfc-scixml
-
9410001.cfc-scixml
-
9410005.cfc-scixml
-
9410006.cfc-scixml
-
9410008.cfc-scixml
-
9410009.cfc-scixml
-
9410012.cfc-scixml
-
9410022.cfc-scixml
-
9410032.cfc-scixml
-
9410033.cfc-scixml
-
9411019.cfc-scixml
-
9411021.cfc-scixml
-
9411023.cfc-scixml
-
9412005.cfc-scixml
-
9412008.cfc-scixml
-
9502004.cfc-scixml
-
9502005.cfc-scixml
-
9502006.cfc-scixml
-
9502009.cfc-scixml
-
9502014.cfc-scixml
-
9502015.cfc-scixml
-
9502018.cfc-scixml
-
9502021.cfc-scixml
-
9502022.cfc-scixml
-
9502023.cfc-scixml
-
9502024.cfc-scixml
-
9502031.cfc-scixml
-
9502033.cfc-scixml
-
9502035.cfc-scixml
-
9502037.cfc-scixml
-
9502038.cfc-scixml
-
9502039.cfc-scixml
-
9503002.cfc-scixml
-
9503004.cfc-scixml
-
9503005.cfc-scixml
-
9503007.cfc-scixml
-
9503009.cfc-scixml
-
9503013.cfc-scixml
-
9503014.cfc-scixml
-
9503015.cfc-scixml
-
9503017.cfc-scixml
-
9503018.cfc-scixml
-
9503023.cfc-scixml
-
9503025.cfc-scixml
-
9504002.cfc-scixml
-
9504006.cfc-scixml
-
9504007.cfc-scixml
-
9504017.cfc-scixml
-
9504024.cfc-scixml
-
9504026.cfc-scixml
-
9504027.cfc-scixml
-
9504030.cfc-scixml
-
9504033.cfc-scixml
-
9504034.cfc-scixml
-
9505001.cfc-scixml
-
9505011.cfc-scixml
-
9505024.cfc-scixml
-
9506004.cfc-scixml
-
9506017.cfc-scixml
-
9508005.cfc-scixml
-
9511001.cfc-scixml
-
9511006.cfc-scixml
-
9601004.cfc-scixml
-
9604019.cfc-scixml
-
9604022.cfc-scixml
-
9605013.cfc-scixml
-
9605014.cfc-scixml
-
9605016.cfc-scixml
-
9605023.cfc-scixml
-
9606028.cfc-scixml
-
9606031.cfc-scixml
-
9607001.cfc-scixml
-
9607019.cfc-scixml
-
9702002.cfc-scixml
-
9703002.cfc-scixml
-
9704002.cfc-scixml
-
9704008.cfc-scixml
-
9706013.cfc-scixml
-
9707009.cfc-scixml
-
9711010.cfc-scixml
-
9806001.cfc-scixml
-
9806019.cfc-scixml
-
9807001.cfc-scixml
-
9808008.cfc-scixml
-
9808009.cfc-scixml
-
9808012.cfc-scixml
-
9809027.cfc-scixml
-
9809106.cfc-scixml
-
9809112.cfc-scixml
-
9810015.cfc-scixml
-
9811009.cfc-scixml
-
9902001.cfc-scixml
-
9904008.cfc-scixml
-
9905001.cfc-scixml
-
9905008.cfc-scixml
-
9905009.cfc-scixml
-
9906004.cfc-scixml
-
9907006.cfc-scixml
-
9907007.cfc-scixml
-
9907010.cfc-scixml
-
paper-structure.dtd
-
additional:
-
0001012.cfc-scixml
-
0003055.cfc-scixml
-
0003060.cfc-scixml
-
0003083.cfc-scixml
-
0005006.cfc-scixml
-
0005015.cfc-scixml
-
0005016.cfc-scixml
-
0005025.cfc-scixml
-
0006003.cfc-scixml
-
0006011.cfc-scixml
-
0006019.cfc-scixml
-
0006021.cfc-scixml
-
0006028.cfc-scixml
-
0006038.cfc-scixml
-
0006044.cfc-scixml
-
0007035.cfc-scixml
-
0008004.cfc-scixml
-
0008005.cfc-scixml
-
0008012.cfc-scixml
-
0008016.cfc-scixml
-
0008017.cfc-scixml
-
0008020.cfc-scixml
-
0008021.cfc-scixml
-
0008022.cfc-scixml
-
0008023.cfc-scixml
-
0008024.cfc-scixml
-
0008026.cfc-scixml
-
0008027.cfc-scixml
-
0008028.cfc-scixml
-
0008029.cfc-scixml
-
0008034.cfc-scixml
-
0008035.cfc-scixml
-
0009027.cfc-scixml
-
0010020.cfc-scixml
-
0011001.cfc-scixml
-
0011007.cfc-scixml
-
0011020.cfc-scixml
-
0102019.cfc-scixml
-
0102020.cfc-scixml
-
9407001.cfc-scixml
-
9907003.cfc-scixml
-
9907013.cfc-scixml
-
9912003.cfc-scixml
-
9912004.cfc-scixml
-
9912005.cfc-scixml
-
paper-structure.dtd
DOWNLOAD CFC CORPUS HERE
Simone Teufel, August 2014