Argumentative Zoning Corpus --Distribution Data

This is the original Argumentative Zoning corpus [AZ corpus], created and annotated by Simone Teufel and collaborators (Byron Georgantopolous, Marc Moens, Vasilis Karaiskos, Anne Wilson, Donald Tennant) between 1996-2004. It consists of 80 AZ-annotated conference articles in computational linguistics, originally drawn from the Cmplg arXiv. The corpus is distributed under the Creative Commons Attribution-NonCommercial 2.0 UK:England and Wales Licence (CC BY-NC 2.0 UK).

The corpus is described in detail in the following book [preferred citation]:

First publication of (a preliminary version of) the AZ corpus was in: The argumentative zoning annotation scheme was first fully published in: I would appreciate a curtesy email if and when you download the AZ corpus, and definitely when you publish new research using this corpus. The corpus is in the SciXML format created by Simone Teufel and also defined in the above book. The SciXML format was first published in: A dtd for this corpus, paper-structure.dtd, which includes definitions for AZ annotation, is also given. The file extensions of the data files are accordingly "az-scixml". In the current corpus, each sentence is annotated by one annotator, according to the AZ annotation scheme. This is marked in the AZ attribute of S and A-S elements. Details about the AZ annotation procedure are in the book, first published in the following paper: The formal XML schema for the SciXML corpus format created by Simone Teufel is given in the DTD file paper-structure.dtd.

Papers have metadata associated with them, manual classification of the type of work, authors, title and where the paper was published. Abstracts are marked; sentences are segmented. Correspondences between sentences in the abstract and the document were semi-manually determined and are marked with the attributes DOCUMENTC and ABSTRACTC.

Citations in running text are marked (REF). Occurrences of author names, without dates, are marked as REFAUTHOR. Attribute SELF marks self-citations. Reference lists are parsed and marked.

Lists of sub-sentences are marked up by the sentence feature 'TYPE=ITEM'.

When you untar, you should expect to see the following files:


Simone Teufel, August 2014