PhD -- short description

PhD Thesis

Topic: Rhetorical text analysis of scientific articles for summarization and digital library applications.
Data: 520,000 word corpus of academic papers (203 papers), marked for internal text structure (headers, title, paragraphs). Most papers between 6--10 pages long. The corpus was drawn from the cmp_lg archive of computational linguistics papers. Corpus collection collectively with Byron Georgantopoulos. Corpus is manually annotated with rhetorical information as to status of representative sentences.
Research: Combination of various well-known heuristics by statistical techniques (Naive Bayes, Maximum Entropy, RIPPER). Focus of all these heuristics is on rhetorical analysis: what argumentative status does a certain sentence have, with respect to the overall rhetorical structure of the paper? I treat argumentative structure of papers as an instance of the conceptual problem-solution space of the reported research.
Annotation work: Do others share my intuitions of this perspective on research papers as problem-solving reports? I try to find out by inter-annotator studies. Currently writing guidelines for the task...
Evaluation: Evaluation is crucial in Summarization, even more crucial than in other text-generating tasks, because we need to establish the function of the abstract with respect to a given task. It might be the case that not even the most coherent, well-written, human-generated abstract will be of use in a given task. Research is needed into formal definitions of tasks, content units and functions of abstracts as a result. I believe that the problem-solution approach is a first step in that direction. Currently thinking about task-based evaluation on human subjects, e.g. relevance decision task.