Anaphora resolution input format

I have used the same input format for all the anaphora resolution algorithms, a kind of XML format. The tags that are used are segment, s, parse, and relations.

Segments

Segments are only used in the Ge and Charniak algorihm (for the other algorithms, one big segment should be placed around the whole document).

Sentence

The sentences are expected to be tagged (using the CLAWS-II tagset) and numbered. The number of parses available is expected at the end of each sentence. This output should be easy to obtain from the RASP parser.

Parse

The Lappin and Leass, and the Kennedy and Boguraev algorithms don't actually need a full parse (this implementation only uses GRs) and so the full parse section can be replaced with a fragment (talk to me before you do this). This may be necessary if you're using an algorithm which only produces grammatical relations.

Relations

The relations tags need to be present (although the relations don't have to be filled in). If a relation tag is filled in, it should relate noun phrase heads together with sentence numbers and sentence positions as illustrated in sentence 3 in the input file. It is very simple to evaluate the performance of the anaphora resolution algorithms if the relations are filled in. It may be useful to evaluate the performance of the algorithm on your domain before undertaking further work.