Citation Function Corpus --Distribution Data

This is the Citation Function Classification corpus [CFC corpus], created by Simone Teufel, Advaith Siddharthan and Dan Tidhar. It consists of 161 CFC-annotated conference articles in computational linguistics, originally drawn from the Cmplg arXiv. The corpus is distributed under the Creative Commons Attribution-NonCommercial 2.0 UK:England and Wales Licence (CC BY-NC 2.0 UK). The citation function classification experiments we performed with the corpus were first published as The corpus and its creation is described in detail in the book Me and my coauthors would appreciate a curtesy email if and when you download the CFC corpus, and definitely when you publish new research using this corpus. The corpus is in the Scixml format created by Simone Teufel and also defined in the above book. A dtd for this corpus, paper-structure.dtd, which includes definitions for citation function annotation, is given in the two data directories. The file extensions of the data files are accordingly "cfc-scixml". Citation function is encoded as one of 12 categories, in the XML attribute "CFunc" that sits on REF XML elements (in-line references). Annotators were encouraged to give a reason for the class they assigned. This is encoded in the "LinkS" XML attribute. The corpus comes in two parts: You should see the following when you untar:

README [the current file]




Simone Teufel, August 2014