This is the original Argumentative Zoning corpus [AZ corpus], created
and annotated by Simone Teufel and collaborators (Byron
Georgantopolous, Marc Moens, Vasilis Karaiskos, Anne Wilson, Donald
Tennant) between 1996-2004. It consists of 80 AZ-annotated conference
articles in computational linguistics, originally drawn from the Cmplg
arXiv. The corpus is distributed under the Creative Commons
Attribution-NonCommercial 2.0 UK:England and Wales Licence (CC BY-NC
2.0 UK).
The corpus is described in detail in the following book [preferred
citation]:
- S.Teufel, The structure of scientific articles: Applications to
Indexing and Summarization, CSLI Publications, 2010.
First publication of (a preliminary version of) the AZ corpus was in:
- S.Teufel, M.Moens, Sentence extraction as a classification task,
Proceedings of the ACL/EACL-97 Workshop on Intelligent Scalable
Summarization, Madrid, Spain, 1997.
The argumentative zoning annotation scheme was first fully published in:
- S.Teufel, M.Moens, Summarising Scientific Articles ---
Experiments with Relevance and Rhetorical
Status. Computational Linguistics 28(4), p.409-446,
2002.
I would appreciate a curtesy email if and when you download the AZ
corpus, and definitely when you publish new research using this
corpus.
The corpus is in the SciXML format created by Simone Teufel and also
defined in the above book.
The SciXML format was first published in:
- S.Teufel, N.Elhadad, Collection and linguistic processing of a large-scale
corpus of medical articles, Proceedings of the 3rd LREC, pages
1214-1219. 2002.
A dtd for this corpus, paper-structure.dtd, which includes definitions
for AZ annotation, is also given. The file extensions of the data
files are accordingly "az-scixml".
In the current corpus, each sentence is annotated by one annotator,
according to the AZ annotation scheme. This is marked in the AZ
attribute of S and A-S elements. Details about the AZ annotation
procedure are in the book, first published in the following paper:
-
S.Teufel, J.Carletta, M.Moens, An annotation scheme for discourse-level
argumentation in research articles,
Proceedings of the Ninth EACL, pages 110-117,
1999.
The formal XML schema for the SciXML corpus format created by Simone
Teufel is given in the DTD file paper-structure.dtd.
Papers have metadata associated with them, manual classification of
the type of work, authors, title and where the paper was
published. Abstracts are marked; sentences are
segmented. Correspondences between sentences in the abstract and the
document were semi-manually determined and are marked with the
attributes DOCUMENTC and ABSTRACTC.
Citations in running text are marked (REF). Occurrences of author
names, without dates, are marked as REFAUTHOR. Attribute SELF marks
self-citations. Reference lists are parsed and marked.
Lists of sub-sentences are marked up by the sentence feature
'TYPE=ITEM'.
When you untar, you should expect to see the following files:
- 00README
- 9405001.az-scixml
- 9405002.az-scixml
- 9405004.az-scixml
- 9405010.az-scixml
- 9405013.az-scixml
- 9405022.az-scixml
- 9405023.az-scixml
- 9405028.az-scixml
- 9405033.az-scixml
- 9405035.az-scixml
- 9407011.az-scixml
- 9408003.az-scixml
- 9408004.az-scixml
- 9408006.az-scixml
- 9408011.az-scixml
- 9408014.az-scixml
- 9409004.az-scixml
- 9410001.az-scixml
- 9410005.az-scixml
- 9410006.az-scixml
- 9410008.az-scixml
- 9410009.az-scixml
- 9410012.az-scixml
- 9410022.az-scixml
- 9410032.az-scixml
- 9410033.az-scixml
- 9411019.az-scixml
- 9411021.az-scixml
- 9411023.az-scixml
- 9412005.az-scixml
- 9412008.az-scixml
- 9502004.az-scixml
- 9502005.az-scixml
- 9502006.az-scixml
- 9502009.az-scixml
- 9502014.az-scixml
- 9502015.az-scixml
- 9502018.az-scixml
- 9502021.az-scixml
- 9502022.az-scixml
- 9502023.az-scixml
- 9502024.az-scixml
- 9502031.az-scixml
- 9502033.az-scixml
- 9502035.az-scixml
- 9502037.az-scixml
- 9502038.az-scixml
- 9502039.az-scixml
- 9503002.az-scixml
- 9503004.az-scixml
- 9503005.az-scixml
- 9503007.az-scixml
- 9503009.az-scixml
- 9503013.az-scixml
- 9503014.az-scixml
- 9503015.az-scixml
- 9503017.az-scixml
- 9503018.az-scixml
- 9503023.az-scixml
- 9503025.az-scixml
- 9504002.az-scixml
- 9504006.az-scixml
- 9504007.az-scixml
- 9504017.az-scixml
- 9504024.az-scixml
- 9504026.az-scixml
- 9504027.az-scixml
- 9504030.az-scixml
- 9504033.az-scixml
- 9504034.az-scixml
- 9505001.az-scixml
- 9506004.az-scixml
- 9511001.az-scixml
- 9511006.az-scixml
- 9601004.az-scixml
- 9604019.az-scixml
- 9604022.az-scixml
- 9605013.az-scixml
- 9605014.az-scixml
- 9605016.az-scixml
- paper-structure.dtd
DOWNLOAD AZ CORPUS HERE
Simone Teufel, August 2014