# Computer Laboratory

Old ACS project suggestions

# Project suggestions from the Natural Language and Information Processing Group for 2010-2011

Note that these projects are aimed at those who might like to continue to undertake research into natural language processing for their Ph.D.

## GR-based Improved Parse Selection

Proposer: Ted Briscoe
Supervisor: Ted Briscoe
Special Resources: None

#### Description

The RASP parser produces ranked directed, connected graphs of bilexical head-dependent grammatical relations (GRs) as output; e.g. Kim badly wants to win:

• ncsubj(want Kim _)
• xcomp(to want win)
• ncsubj(win Kim)

(see Briscoe, Andersen et al for more details and examples). GR graphs are statistically ranked using an unlexicalized structural model so ranking of PP attachment, noun compounds, etc, can be incorrect, but the parser is also able to output weighted sets of GRs from the best n derivations, which mostly contain the right analyses to select from. Research has shown that lexical cooccurrence information in GRs can improve parse selection accuracy, but tends to domain-sensitive, so a method for acquiring such information from raw text rather than treebanks would be useful.

The project will investigate unsupervised methods for improving parse selection accuracy based on GR output from the unlexicalized parser (e.g. van Noord, Watson et al.). The BNC has been annotated with GRs automatically using unlexicalized RASP (Andersen et al), and the WSJ DepBank/GR test data can be used for evaluation (Briscoe and Carroll, 2006), so the basic task is to implement and evaluate a GR reranking scheme using self-training or confidence-based estimates of the probabilities of specific GRs. If time allowed, the project might also investigate smoothing GR probability estimates to handle unseen GRs using estimates of lexical similarity.

## Named Entity Recognition and Parsing

Proposer: Ted Briscoe
Supervisor: Ted Briscoe
Special Resources: None

#### Description

The RASP parser brackets NPs and semantically classifies some of them (on the basis of CLAWS tag distinctions and internal structure) into names (places, people, organisations), numbers (including ranges, dates, etc), measure phrases (ounce, year, etc), temporal expressions (days, weeks, months), directions (north, south, etc), partitives (sort of, etc), pronouns, and so forth (see Briscoe for details). However, most remain semantically underspecified as normal' because they contain a standard mass or count noun as head (e.g. the company / man are both normal', though they are often further classified as named entities (NEs) ORGanisation, PERson etc.).

NE recognition (NER) has been the subject of a series of competitions with associated datasets and evaluation software, see Wikipedia for a summary. Integrating NER and parsing would be beneficial for at least the following reasons. Firstly, it should be possible to reduce or remove the requirement for training data annotated with NE classes and boundaries by exploiting the CLAWS tags and NP bracketing (Ritchie). Secondly, many NEs contain internal structure and compositional semantics (Bank of England is an ORG containing a LOCation), and/or are encoded elliptically inside coordinate constructions ( the Banks of England and France is two ORGs 'Bank of England', 'Bank of France'), and/or may contain intervening material ( the Interleukin II (IL-II) promoter is a PROTein 'Interleukin-II promoter' with interleaved acronym) so are better represented and recovered from grammatical relations or compositional semantic structures (Mazur and Dale). Finally, better NER integrated with parsing should improve performance on both tasks by mutually constraining output from each (Finkel and Manning).

The project will develop an approach to NER integrated with parsing which achieves some of these benefits without the need for full supervision or a joint model. One approach could be to develop a classifier of NP head words and a sequential model of how semantically-classified heads combine in complex NPs, and use this to rerank the top {\it n} parses output by RASP.

## Topic Models for Location-Based Social Network Feeds

Proposer: Stephen Clark
Supervisor: Stephen Clark, Cecilia Mascolo, Anastasios Noulas
Special Resources: None

#### Description

We have access to a dataset where users' communication content is being described, together with the geographical position of the locations where the content generation took place. In addition, there is information about a user's social acquaintances.

The focus of the project will be processing and analysing the text describing the user communication, making use of the location-based information as part of the analysis. For example, one possibility would be to build a topic model analysing the semantic topics contained in an individual's communication history, based on standard topic models from the language processing literature, but also include variables in the model which represent the location of the user. The hypothesis would be that the location of a user affects the topic of the user's discussions. In addition to the analysis of individual users, another aspect of the project could consider identifying the topics discussed at given places or areas of a city. That would require the analysis of text generated by multiple users at proximate locations.

Remarks: This is a good opportunity for a student to work with an exciting, cutting-edge social networks dataset, and to combine natural language processing techniques with social network analysis. This is a collaborative project between the Natural Language and Networks research groups.

## Biomedical NLP: Parsing and Event Coreference Resolution

Proposer: Stephen Clark
Supervisor: Stephen Clark, Maria Liakata (EBI)
Special Resources: None

#### Description

In recent years there has been growing interest in the use of natural language processing techniques to extract information from biomedical documents. This area of research, BioNLP or Bio-text mining, is driven by the advances in the Life Sciences and the plethora of articles generated, which make human curation of biomedical resources almost impossible. The challenges in BioNLP are multi-fold, stemming from the very nature of biomedical texts: even low level tasks such as tokenization, sentence boundary detection and named entity recognition differ significantly and require new solutions. At a higher level, semantic representations, discourse, context and domain knowledge play an especially important role in information extraction from biomedical papers. While the challenges are numerous, there is also a variety of resources at hand including knowledge bases and ontologies, corpora and a wide selection of open access papers. The rewards from working in BioNLP match the difficulties as knowledge discovery from biomedical text has the potential of leading to breakthroughs in the Life Sciences.

A lot of work in BioNLP so far has been conducted on paper abstracts rather than full papers, which are arguably more useful and knowledge rich but harder to process. We propose two projects in the context of BioNLP, which address challenges in full biomedical papers.

### BioNLP Project 1: Extrinsic parser evaluation on full biomedical text

There is not much one can achieve in NLP without parsing. In BioNLP parsers have been evaluated on biomedical abstracts on grammatical relations and also extrinsically for event extraction. However, structure and content differs significantly between abstracts and full text [1]and evaluation on full papers is well in need. The proposed project would compare three main parsers adapted to BioNLP (C&C, Enju, McClosky) on the recognition of multi-word bio-entities in the CALBC silver corpus. The parsers could also be compared on a task classification task, which employs GR relations as features to recognise sentence-level core scientific concepts in a corpus of scientific papers [[4]& work in progress].

### BioNLP Project 2: Identifying co-referring events in full papers

Extracting bio-events only at the sentence level without taking into account information in the wider discourse is of rather limited use [3]. This is even more prominent an issue in full papers than it is in abstracts. Recent work [5] has looked at imploying coreference information and salience in discourse [4] to extract event-argument relations. The proposed project will look at similar methods for identifying coreferring sentences of the same category (e.g. coreferring results or conclusions) in the context of the CoreSC corpus of full papers [[4] & work in progress]. The latter has been annotated at the sentence level with 11 different categories representing core scientific concepts and also contains annotations for co-reference between sentences of the same type.

References
[1] Cohen K. B., Johnson H. L., Verspoor K., Roeder C., and Hunter L.E. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics, 11(492), 2010.
[2] Andrew Clegg and Adrian Shepherd. Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinformatics, 8(1):24, 2007.
[3] A. de Waard, S. Buckingham Shum, J. Carusi, A.and Park, M. Samwald, and Á. Sándor. Hy- potheses, evidence and relationships: The hyper approach for representing scientific knowledge claims. In Proceedings 8th International Semantic Web Conference, Workshop on Semantic Web Applications in Scientific Discourse., ecture Notes in Computer Science, Washington DC, 2009. Springer Verlag, Berlin.
[4] M. Liakata, S. Teufel, A. Siddharthan, and C. Batchelor. Corpora for the conceptualisation and zoning of scientific papers. In Proceedings of the 7th International Conference on Language Resources and Evaluation, Valetta,Malta, 2010.
[5] Katsumasa Yoshikawa, Sebastian Riedel, Tsotomu Hirao, Masayuki Asahara, and Yuji Mat- sumoto. Coreference-based event-argument relation extraction on biomedical text. In Sympo- sium for Semantic Mining in Biomedicine (SMBM), Cambridge, U.K., 2010.

Remarks: This is a collaboration between the NLIP group and the European Bioinformatics Institute (EBI).

## Using Parsers to Construct and Query Large-Scale Entity Relationship Graphs

Proposer: Stephen Clark
Supervisor: Stephen Clark, Gjergji Kasneci (MSR)
Special Resources: None

#### Description

From the MSR website (http://research.microsoft.com/en-us/groups/osa/krr.aspx):

"Imagine a knowledge discovery task that aims at retrieving commonalities or broad relations between two, three or more entities of interest. An example could be the query that asks for the relation between Niels Bohr, Richard Feynman, and Enrico Fermi. Possible answers are that all of them were quantum physicists, theoretical physicists, members of the Manhattan Project, etc. State-of-the-art search engines would only return relevant results to such a query if the given entities and their relations were mentioned on the same Web sites. However, in general, the relevant pieces of information could be distributed across several Web pages and consequently, the standard page-oriented keyword-search paradigm is not sufficient to deal with such tasks. Hence, our focus is on a more general approach to access the knowledge on the Web."

We have access to a large knowledge base of facts extracted from the web, which can be used to facilitate complex queries relating entities. The purpose of the project will be to investigate how natural language parsing techniques (using an existing parser) can be used to facilitate a) the extending of such a database with new facts; and b) the querying of the database.

Remarks: This is a collaboration between the NLIP group and Microsoft Research Cambridge (MSR). An interest in processing large datasets is a prerequisite.

## Unsupervised learning for joint segmentation and part-of-speech tagging

Proposer: Yue Zhang
Supervisor: Yue Zhang and Stephen Clark
Special Resources: None

#### Description

Joint word segmentation and POS-tagging is the problem of solving word segmentation and part-of-speech tagging simultaneously. By reducing error propagation and allowing POS information to help word segmentation, both segmentation and overall tagging could be improved. There has been some recent research on joint word segmentation and POS tagging, reporting competitive accuracies on the Chinese Treebank.

The current approach for joint segmentation and POS-tagging is mainly supervised. A set of manually annotated sentences from CTB are used to train a system, and then another set of texts from the same treebank are used to test the accuracies. Supervised learning is expensive: it requires much human labour. On the other hand, there are abundant unannotated text available. Unsupervised and semi-supervised approaches make use of these free resources, and improve the accuracies. They have been little explored for joint segmentation and POS-tagging so far.

Two existing approaches can be applied to joint segmentation and POS-tagging, and achieve semi-supervised learning. First, unsupervised character clustering can be performed on a large set of raw text. The resulting character classes can be used as a feature in a supervised segmentor / tagger. Results have shown that this method improves both the segmentation accuracy.

Second, self-training can be performed. The idea is to use a trained system to segment and tag raw text. From the output, some statistical information can be collected. When processed properly, this information can be filtered / organized into features. Self-training has been show to improve accuracies for a variety of NLP tasks.

References
Yue Zhang and Stephen Clark. Chinese Segmentation Using a Word-based Perceptron Algorithm. In proceedings of ACL 2007. Prague, Czech Republic. June.
Yue Zhang and Stephen Clark. Joint Word Segmentation and POS Tagging Using a Single Perceptron. In proceedings of ACL 2008. Ohio, USA. June.
Yue Zhang and Stephen Clark. A Fast Decoder for Joint Word Segmentation and POS-tagging Using a Single Discriminative Model In proceedings of EMNLP 2010. Massachusetts, USA. October.

Requirements
Basic knowledge of C++. (A supervised system in cpp will be provided)

An intuitive example
Given the input sentence 上海浦东开发与法制建设同步, the existing system will produce the tagged output 上海_NR 浦东_NR 开发_NN 与_CC 法制_NN 建设_NN 同步_VV. The output is a segmented and part-of-speech tagged sentence, where each word is separated from its POS by '_' and NR, NN, CC and VV refer to proper noun, noun, conjunctive and verb, respectively. The above output is correct. However, given the sentence 玩3D游戏有必要搞独立显卡么？, the output of the sytem is 玩3D_NR 游戏_NN 有_VE 必要_NN 搞_VV 独立_JJ 显卡么_NN ？_PU, where 玩3D and 显卡么 has segmentation and POS errors. The reason for the errors are largely because the training data was in newspaper text, while the test data was from computer forums. The semi-supervised system should improve the performance of the system in such unknown domains.

## Using distributional methods in parse ranking

Proposer: Ann Copestake
Supervisor: Ann Copestake

#### Description

Standard techniques for parse ranking have limited capability to determine the extent to which constructions are semantically plausible. This project will look at the use of distributional semantics to rerank the output from a broad-coverage grammar of English. Distributional semantic methods depend on the idea that words with similar meanings will appear in similar contexts. The notion of context can be simply a window of words in a large corpus but often parsed data is used. The intuition behind the use of distributional semantics in parse ranking is that it offers an additional source of information to that which is generally used (i.e., based on treebanks). While some experiments have been carried out on the use of distributional techniques for syntactic disambiguation, these have been done on isolated test sets (e.g., of noun-noun compounds) and it is not clear whether they extend to more complex constructions and whether they would offer a real advantage compared with existing parsing models.

Preliminary experiments were carried out in last year's JHU CSLP Summer Workshop (see Chapter 6 of the JHU Report). However, the availability of large quantities of Treebanked data from DELPH-IN might make it preferable to experiment with the DELPH-IN English Resource Grammar (ERG). The easiest approach would probably be to rerank parses, but it should also be possible to experiment with adding distributionally-based features to the existing parse-ranking model.

#### Remarks

This is an open-ended project which could easily lead on to PhD research if desired. In order to obtain results during the ACS, it would be necessary to restrict the experiments to a limited range of constructions, possibly coordination (as with the JHU Workshop) or prepositional phrase attachment.

## Paraphrase using DELPH-IN technology and DMRS

Proposer: Ann Copestake
Supervisor: Ann Copestake

#### Description

The informal DELPH-IN collaboration produces technology for parsing and generation in a number of languages. Recently a new semantic representation has been developed, Dependency MRS, Copestake 2009, which has a number of computational advantages. This project will look at semantic paraphrase using DMRS. We model paraphrase by a process where a sentence is analysed by a parser to produce a semantic representation which is then transformed into another semantic structure which can then be used for generation. A very similar approach is used in the semantic transfer approach to machine translation. Paraphrase is currently possible in DELPH-IN, making use of the semantic transfer machinery between the earlier MRS representations, but the rules can become very complex which makes them difficult to write manually or to acquire automatically. The first part of this project will be the development of a suitable DMRS transformation language. This should work for all languages for which we have a grammar capable of producing DMRSs. The expressivitiy of the language will be demonstrated using a paraphrase test set which can be constructed by expanding existing semantic test sets. Subsequent elements of the project are more open-ended: one possibility would be to look at implementing paraphrase on packed DMRS structures.

## Construction of DMRS representations with the Clark and Curran parser

Proposer: Ann Copestake
Supervisor: Ann Copestake

#### Description

This project concerns the semi-automatic construction of compositional semantic representations using the Dependency MRS, Copestake 2009 language. DMRS representations can already be built for a variety of grammars going via the widely-used MRS representation. In particular, MRS is output from the DELPH-IN English Resource Grammar using hand-built lexical types and grammar rules. The aim of this project is to develop techniques to semi-automatically induce composition rules from the Clark and Curran parser (see Clark and Curran (2007)) in order to produce comparable representations. This has practical value: it would, for instance, allow the two parsers to be used in the same applications without altering other modules. It would also be of considerable theoretical interest. The proposed approach is to start with a simple test set of gold-standard DMRSs produced from the ERG and to parse the same sentences with the C-and-C parser in order to semi-automatically map between them.

## Using Lexical Resources to Improve Parsing Accuracy

Proposer: Anna Korhonen / Laura Rimell
Supervisor: Anna Korhonen / Laura Rimell

#### Description

To be effective, NLP must work at many levels, from entire texts down to sentences, phrases, and words. This project focuses on lexical information - that is, information about words - and how it can be used to improve accuracy on other tasks. For example, it may be helpful for a natural language parser or machine translation system to know about the behavior of specific verbs, such as the fact that "the group believes the management is interested" and "the group believes the management" are grammatical, but "the group believes is interested" is not. This is called subcategorization information and is an important area of lexical knowledge.

Lexical resources are particularly important for domain adaptation. A major challenge for parsing and other NLP tasks is the difference in vocabulary across disparate domains such as finance, sports, and science. For example, the word "induce" may occur more frequently, or with different usage, in biomedical research articles than in finance journalism. Lexical information can help a parser adapt to a new domain without having to learn an entirely new model of the grammar. Similarly, even for a parser which already incorporates lexical information, access to a large lexical resource may improve accuracy without additional changes to the parser model.

This project will test the use of a large subcategorization lexicon to improve the accuracy of a lexicalised parser by supplementing the parser's existing lexicon. The student will create a lexical resource using the system of Preiss, Briscoe, and Korhonen (2007), and test it with the C&C parser (Clark and Curran 2007). It will be possible to work on general text as well as adaptation to domain-specific text using specialized lexicons.

#### References

Stephen Clark and James R. Curran. 2007. Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. Computational Linguistics, 33(4), pp.493-552.

Judita Preiss, Ted Briscoe and Anna Korhonen. 2007. A System for Large-scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Prague, Czech Republic.

#### Remarks

The parser code is written in C++, and the code for creating lexical resources in Lisp and Perl. However, the project will probably involve modifying the parser's supplementary files rather than the parser code, so the main requirement is scripting knowledge. Background will be provided on CCG, the formalism underlying the parser, as well as on subcategorization frames.

## Citation-Block Determination

Proposer:Simone Teufel
Supervisor: Simone Teufel
Special Resources: None

#### Description

Automatically finding citations in scientific text is a precondition for citation-based search and for bibliometric metrics. The particular typographical conventions concerning citations in different domains make the determination of citations one of the easier tasks in NLP.

This project is trying to improve on purely typographic determination of citations, in terms of a notion of "citation block". A citation block is a contiguous area in a scientific paper, which semantically "belongs to" the citation, i.e., which describes content related to the citation. For the purpose of this project, we will define the borders of a citation block to coincide with sentence boundaries.

This project is to explore two different coherence-based approaches to finding citation blocks: coherence by lexical chains, and coherence by lexical repetition. These approaches have been successful in other tasks (Hearst, 1998; Barzilay and Elhadad, 1997). Alternative method use anaphora resolution (Kim and Webber 2006) and coreference (Kaplan and Tokunaga 2009). The project involves implementing at least two of the mentioned approaches, running them on an existing, citation-parsed corpus of around 16,000 scientific texts in one area, and using an evaluation method of choice to determine how well the algorithms perform relative to each other, and to a baseline. Evaluation possibilities are a) a gold-standard evaluation, which means that the student performs some annotation or b) a human evaluation study, where human subjects are asked if they agree with the system's boundaries.

This project would suit a student who is interested in algorithms (e.g., the lexical chain algorithm), and who likes data work (e.g., looking through dozens of citation blocks, deciding where they start and end). The student should have good intuition about writing style in science, and be able to generalise over similarities in writing style. Programming language of choice.

#### References:

Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.

Hearst, M. Text Tiling. Computational Linguistics. 1997.

Kaplan, D., Iida, R and Takunaga, T. Automatic extraction of citation contexts for research paper summarization: a coreference-chain based approach. 2009. ACL Workshop on NLP and IR for Digital Libraries (NLPIR4DL09).

Kim, Y. and Webber, B. Automatic Reference resolution in astronomy articles. Proc. of 20th International CODATA Conference 2006.

## A Citation-Based Summariser

Proposer:Simone Teufel
Supervisor: Simone Teufel
Special Resources: None

#### Description

This project concerns how to automatically construct a summary, given several citation sentences talking about the same paper. Challenges concern: 1. how to select the right raw material 2. how to linguistically process the selected raw material.

This project is rather open-ended, as the question of how to summarise scientific text is a very large one. However, citation-based summarisation can rely on a previous classification of sentiment towards the citation (provided). Sentences with the same sentiment should then be clustered, using an out-of the box clustering algorithm (e.g., cludo). Realisation of the final summary could then be very simple (just displaying verbatim sentences), or more sophisticated (e.g., reduction of the sentence based on a parse). The resulting summary would then be evaluated by human judgement. The student would apply experimental methodology in setting up the judgement-based evaluation.

As this project could concentrate on various of the substeps, the exact scope of this project can be negotiated, dependent on the student's particular interests. An ideal student for this project would therefore be rather independent and able to digest a rather large literature. Programming language of choice.

#### References:

DUC summarisation conference webpage. Look for paper called "HedgeTrimmer" (Dorr et al. 2003) for one possible sentence shortening approach.

Barzilay, R and McKeown, K. ACL 1999. Information Fusion in the context of multi-document summarization.

Teufel, 2001. Task-based evaluation of summary quality: Describing relationships between scientific papers. NAACL-Workshop on Summarisation.

## Determination of rhetorically charged sentences in scientific writing

Proposer:Simone Teufel
Supervisor: Simone Teufel
Special Resources: None

#### Description

Detecting innovation in science across an entire field, by analysing the scientific literature automatically, is currently a hot research area. While most approaches rely on a statistical analysis of the words contained in the papers, this project uses parsing and machine learning to detect rhetorically charged sentences, and therefore uncovers innovations which are explicitly declared in the paper.

In particular, this project concentrates on the detection of:

• Statements of innovation ("to our knowledge, we are the first to...")
• Naming statements of own artefacts("a process we name MILRED")
• Naming statements of others' citations ("Miller and Berger, 1998, henceforth called M&B").

(Possibly a subset of these).

Starting from a set of annotated sentences of these kinds, and known indicator phrases for these sentences, parsing and WordNet are to be used to detect similar statements in unseen text. Evaluation might rely on a gold standard (created by the student themselves), or on human evaluation.

The student choosing this project should have an interest in semantics and should not shy away from data analysis. Programming language of choice.

#### References:

Lisacek, F. Chichester, C., Kaplan, A, Sandor, A. Discovering paradigm shift sentences in biomedical abstracts. International Symposium on Semantic Mining in Biomedicine (SMBM). 2005.

## Topic models of social network data

Contact: Diarmuid Ó Séaghdha, Daniele Quercia, Stephen Clark

The family of hierarchical probabilistic models known as "topic models" have become standard tools for a variety of tasks in document and language processing. While the best-known topic models such as Latent Dirichlet Allocation (LDA; Blei et al., 2003) view a collection of documents as being without structure, subsequent developments have built on LDA to incorporate knowledge about structural properties of the data. One such development has been the proposal of models that apply when the documents correspond to nodes in a graph or network (e.g., Daumé III, 2009; Chang and Blei, 2010). Such models capture the intuition that proximity in the network is likely to correlate with proximity in semantic content.

Social networks such as Facebook and Twitter can be viewed through a topic modelling lens by making the analogy that a user in the network, or the content produced by the user, is similar to a document in a more traditional collection. The goal of the project proposed here is to produce a joint model of the content and structure of a social network, using existing methods for graph-aware topic modelling.

A number of datasets are available, each lending itself to a different application:

• London-based Twitter users: We have a dataset that contains the profiles, tweets, and ego-networks of 258,384 Twitter users in London. The goal is to learn in a minimally-supervised way how to identify unknown user properties such as location. Many Twitter users give only a vague location (e.g., "London"), but in many cases information from tweets and follows/following connections can indicate a more precise one. One approach to this would use insights from "Labeled LDA" (Ramage et al., 2009).
• last.fm profiles: We have a dataset that contains: (a) which music artists 360,000 Last.fm users have listened to; and (b) which individual songs 1,000 Last.fm users have listened to. The goal is to improve the quality of music recommentations by identifying probable links between distant users. An understudied aspect of recommendation quality is diversity. If I have listened to 50 songs by Frank Zappa, I do not gain much information from a recommendation of 10 other Frank Zappa songs; I would rather receive 10 diverse recommendations that I may not have heard of before. There is obviously an interesting trade-off between diversity and precision.
• Psychological profiles of Facebook users: We have a dataset that contains the results of psychometric tests taken by more than 2.000.000 Facebook users. As well as the data from the tests, around 40% of the respondents agree to give us access to their facebook profile data, and social network data.The goal would be to integrate test results in a model that also captures the textual content and friendship links for those users.

This project is suitable for a student with some knowledge of machine learning and an interest in social network analysis and/or natural language processing.

References: