Research issues

This is an attempt to explain where my research interests fit within computational linguistics (natural language processing) and linguistics. I've organised it according to the major themes that I believe link some of the various pieces of work I've been involved in. It is a personal statement and should not be taken to reflect any `official' viewpoint or to describe the views of any of my collaborators. It was originally written in 2005 and has been updated in January 2010.

I will preface this discussion of themes with some general remarks. I have started to use the term language modelling to describe what my research is about. This term is generally used narrowly to refer to assigning a probability to a sequence of words (for instance as part of a speech recogniser), but I use it much more generally, to refer to producing a formal computational description of (some aspect of) human language. This is a scientific endeavour: attempting to understand how language works by producing simulations which can be tested. As such, it's analogous to modelling weather or the movement of fluids or chemical reactions. It is unlike normal sciences primarily in that a) to an extent, the data can be observed and elicited without any special apparatus or technology b) it inter-connects with really deep and murky issues of human thought and intelligence.

The connections between this view of language modelling and linguistics are complex. Field linguists gather data we might want to model. Formal syntax and semantics can be seen as being about language modelling, in the sense I'm using it, and much of my research uses concepts developed in linguistics, but most linguists refuse to consider the use of probabilities in their models. I discuss this a bit further below. There are also fundamental issues about the abstractions of the data that linguists generally adopt, which are too complex to discuss adequately here.

There are many types of language model. We may be interested in modelling the probability of word sequences, or in modelling the interpretations that are possible for word sequences (in some context), or in modelling the words that are generated given that someone wants to convey a meaning. I'm primarily interested in Deep language modelling which involves modelling meaning and not just probabilities of sequences. However, I'm interested in doing this with broad-coverage models rather than just studying individual language constructions. And although this discussion has been very abstract, I am interested in applications, as discussed in more detail below.

Combining generative and data-driven approaches to language

Formal linguistics has emphasised the idea of language as being essentially generative: words can be combined into phrases, phrases into utterances and so on, and meaning is built up compositionally. Lexical meaning is minimally represented, typically as a symbol corresponding to the lexeme which is to be understood as a pointer to a set of entities (e.g., the meaning of the word vote is represented as vote' with no explicit connection to concepts such as election' or poll'). The compositionally-derived meaning of an utterance may be highly ambiguous or underspecified. A hearer is expected to use reasoning on the basis of world knowledge and assumptions about the context and the speaker to fully understand an utterance.

In computational linguistics, these assumptions have been translated into an architecture for analysis in which a syntax-driven parser is used to derive structure that allows the meaning of individual words to be combined. The meaning representation produced in this way can then be passed to an AI engine capable of inference. That engine is assumed to have axioms concerning the symbols such as vote'. The classic natural language processing systems of the 1970s, such as LUNAR and SHRDLU, assumed this model in some form or another. At the level of pragmatics, AI and linguistic philosophy really came together with Perrault and James Allen's (1980) plan-based account of Searle's speech act theory (e.g., 1975).

The deficiencies of this model as a foundation for computational work are at least implicitly now recognised by most researchers in computational linguistics. One major source of problems is the lack of availability of world knowledge and full AI-style inference. The classic NLP systems turned out to be toys that could not be scaled up from tiny domains. Some CL researchers, however, still believe in the classic model for the linguistic aspects of processing and believe that the role of statistical techniques, as used for parse selection for instance, is as an approximation to the use of world knowledge. That is, that statistical techniques work primarily because they make up for the failure of classical AI in disambiguation and so on. I think this is not the case and that a major part of the effectiveness of data-driven and statistical techniques is in compensating for the failures of the classic linguistic model to account for some aspects of language processing, in particular with respect to disambiguation (and instantiation of underspecified or vague terms), multiword expressions and lexical semantics.

I think that language is best described as semi-generative, involving a complex interaction between productivity and conventionality. Data-driven techniques can model classes of conventionality in language that are not generally accounted for or even appreciated in formal linguistics. In fact, I suspect that the majority of computational linguists now believe that formal linguistics research is almost completely irrelevant to their concerns. I hope that this isn't the case, and that models which combine a classic generative approach with a statistical approach that properly allows for lexical phenomena will turn out to be the most useful for computational work in the long term. What I am interested in doing, on a theoretical level, is discovering more about how conventionality and generativity interact. On a more practical level, I am interested in building large-scale systems that combine work motivated by formal linguistics with data-driven approaches.

My most recent work that directly addresses the theoretical issues is not yet published, but see:

The project `Integrating pragmatic insights with HPSG is also related.

Some notes about terminology: my use of `generative' may be confusing, but I can't think of a more appropriate term, so I hope my use is clear from context. Some people talk about `empirical' methods in CL in contrast to `theoretical' or `linguistic' approaches. But large-scale, hand-built grammars of the sort that are available via DELPH-IN (see below) are necessarily constructed based on real data - this is just as `empirical' as a grammar built automatically on the basis of a hand-annotated corpus. Annotators and grammar-writers alike are working on the basis of some theoretical notion of syntax: DELPH-IN grammar writers are using a more complex theory than underlies the Penn Treebank, but that is not a fundamental difference. By data-driven, I mean approaches which do not presuppose a particular classification. Examples are ngram language models, word sense induction via clustering, vector space models for lexical semantics and some approaches to collocation. But in many cases, a data-driven statistical model gives a distribution over a non-data-driven (predefined) classification. Of course a statistical model for disambiguation can be constructed on the basis of a hand-built grammar: the Redwoods work demonstrates this for DELPH-IN.

Tools and systems for computational linguistics

All my main work on building tools and systems is now distributed via DELPH-IN. Most of this concerns tools for generative approaches (in the sense discussed above) although with suitable hooks for data-driven work (where someone has figured out a good way of making the connection). The LKB system is well documented elsewhere. I am still very actively involved in developing it. Current work is mostly on semantics. The tools for processing MRS and RMRS are essentially independent of the LKB, although compatible with it. I am committed to making everything available as Open Source and to scale reasonably. Personally I think that one of the DELPH-IN slogans should be No more toy systems! since I think more damage has been done to CL by people claiming results on the basis of tiny, flaky systems than by anything else. Not only does it give outsiders an unrealistic idea of the field, it blocks others getting research grants to do something properly. Obviously people have to be able to build research prototypes and to try things out before moving to large-scale, more efficient implementations. But there is no excuse for having a series of projects and delivering throw-away `results'.

Some musings about Open Source in CL.

Representation issues in computational linguistics

Computational compositional semantics

The invited talk I gave at EACL 2009 is the manifesto for Slacker semantics (I am sure I'm going to regret using that name). The link above points to the slides. There is a paper in the proceedings: Slacker Semantics: Why Superficiality, Dependency and Avoidance of Commitment can be the Right Way to Go but the slides are directed at a more general audience and probably give a better idea of the background. In the context of what I'm discussing here, the point is that we want a broad coverage semantic representation which is a good model of what we get from syntax and inflectional morphology, so it's important that we don't overcommit. This is the area that I'm personally working on most actively at the moment.

Language generation

I've mostly worked on what used to be called tactical generation, which is the process of going from a logical form to a string. The term realization is now more generally used, but often the assumption is that the input to realization is closer to the syntax than the logical representations that I work with. The LKB system allows generation as well as parsing with the same grammars. I also did some work with Advaith Siddharthan on generation of referring expressions. With Guido Minnen and Francis Bond, I worked on determiner choice and that work was followed up by two MPhil projects which I supervised (by Gabriel Ezekiel and Lin Mei). These slides illustrate how I think this might all fit together. I think that the ability to use the ERG with the PET and the LKB to generate and paraphrase could be exploited in some interesting projects, although I haven't managed to work out a compelling practical application yet. Incremental generation is an area which has been neglected.

Lexical semantics

Lexical semantics (including multiword expressions): to be written.

Applications

I do not centre my long-term research goals around particular applications. This is not because I think building working systems is uninteresting - in fact it can be real validation of the research. But what is useful and what isn't is hugely dependent on factors other than the computational linguistic research itself. Advances in computer speed and (more importantly) storage, the Web, the move to electronic publication of newspapers and books, the advent of GUIs and the ever-changing obsessions of the main funders have had far more influence on NLP research than anything that has come from within the field. To give one example: natural language interfaces to databases (NLIDs) were one of the main applications considered by NLP researchers from the late 1970s to the mid-1980s. Many usable systems were built and there were several reasonably successful companies distributing such technology. But the market ceased to exist, at least partly because of graphical user interfaces which gave users an alternative to formal languages such as SQL. Most of the research directed specifically at NLIDs dried up.

This means that it makes sense to concentrate on producing core technology such as analysers and generators. The expectation is that these will find many uses, so working on improving speed and robustness, for instance, will ultimately turn out to be useful. It is always important to test such technology on real data, preferably data that is representative of some application, but the central idea behind our research on DELPH-IN is that we can make progress by building language processing modules that are (mostly) application independent. In fact, systems very similar to NLIDs are now making a reappearance in limited-domain question answering and email response where there is no GUI alternative.

I am trying to make sure my research work is relevant while avoiding the applications trap by working on a family of applications connected with scientific text. This also fits in with other work in the NLIP group. The idea behind SciBorg is to try to develop NLP tools and architecture with the goal of helping people who need to access Chemistry texts. The project involves text mining and information extraction, but we are considering this very broadly: for instance, we are aiming to extract information about the relationship between papers (via Simone Teufel's AZ techniques).