PhD -- short description
PhD Thesis
-
Topic: Rhetorical text analysis of scientific articles for
summarization and digital library applications.
-
Data: 520,000 word corpus of academic papers (203 papers),
marked for internal text structure (headers, title, paragraphs). Most
papers between 6--10 pages long. The corpus was drawn from the cmp_lg
archive of computational linguistics papers. Corpus collection
collectively with Byron Georgantopoulos. Corpus is manually annotated
with rhetorical information as to status of representative sentences.
-
Research: Combination of various well-known heuristics by
statistical techniques (Naive Bayes, Maximum Entropy, RIPPER). Focus
of all these heuristics is on rhetorical analysis: what argumentative
status does a certain sentence have, with respect to the overall
rhetorical structure of the paper? I treat argumentative structure of
papers as an instance of the conceptual problem-solution space of the
reported research.
-
Annotation work: Do others share my intuitions of this
perspective on research papers as problem-solving reports? I try to
find out by inter-annotator studies. Currently writing guidelines for
the task...
-
Evaluation: Evaluation is crucial in Summarization, even more
crucial than in other text-generating tasks, because we need to
establish the function of the abstract with respect to a given
task. It might be the case that not even the most coherent,
well-written, human-generated abstract will be of use in a given
task. Research is needed into formal definitions of tasks, content
units and functions of abstracts as a result. I believe that the
problem-solution approach is a first step in that direction.
Currently thinking about task-based evaluation on human subjects,
e.g. relevance decision task.