Anaphora resolution input format
I have used the same input format for
all the anaphora resolution algorithms, a kind of XML format. The tags
that are used are segment, s, parse, and
relations.
Segments
Segments are only used in the Ge and Charniak algorihm (for the other
algorithms, one big segment should be placed around the whole
document).
Sentence
The sentences are expected to be tagged (using the CLAWS-II tagset)
and numbered. The number of parses available is expected at the end of
each sentence. This output should be easy to obtain from the RASP
parser.
Parse
The Lappin and Leass, and the Kennedy and Boguraev algorithms don't
actually need a full parse (this implementation only uses GRs) and so
the full parse section can be replaced with a fragment (talk to me
before you do this). This may be necessary if you're using an
algorithm which only produces grammatical relations.
Relations
The relations tags need to be present (although the relations don't
have to be filled in). If a relation tag is filled in, it should
relate noun phrase heads together with sentence numbers and sentence
positions as illustrated in sentence 3 in the input file. It is very
simple to evaluate the performance of the anaphora resolution
algorithms if the relations are filled in. It may be useful to
evaluate the performance of the algorithm on your domain before
undertaking further work.