I will preface this discussion of themes with some general remarks. I have started to use the term language modelling to describe what my research is about. This term is generally used narrowly to refer to assigning a probability to a sequence of words (for instance as part of a speech recogniser), but I use it much more generally, to refer to producing a formal computational description of (some aspect of) human language. This is a scientific endeavour: attempting to understand how language works by producing simulations which can be tested. As such, it's analogous to modelling weather or the movement of fluids or chemical reactions. It is unlike normal sciences primarily in that a) to an extent, the data can be observed and elicited without any special apparatus or technology b) it inter-connects with really deep and murky issues of human thought and intelligence.
The connections between this view of language modelling and linguistics are complex. Field linguists gather data we might want to model. Formal syntax and semantics can be seen as being about language modelling, in the sense I'm using it, and much of my research uses concepts developed in linguistics, but most linguists refuse to consider the use of probabilities in their models. I discuss this a bit further below. There are also fundamental issues about the abstractions of the data that linguists generally adopt, which are too complex to discuss adequately here.
There are many types of language model. We may be interested in
modelling the probability of word sequences, or in modelling the
interpretations that are possible for word sequences (in some
context), or in modelling the words that are generated given that
someone wants to convey a meaning. I'm primarily interested
in
In computational linguistics, these assumptions have been
translated into an architecture for analysis
in which a syntax-driven parser is used to
derive structure that allows the meaning of individual words to
be combined. The meaning representation produced in this way can then be
passed to an AI engine capable of inference.
That engine is assumed to have axioms concerning
the symbols such as vote'.
The classic natural language processing systems of the
1970s, such as LUNAR and SHRDLU, assumed this model in some form or another.
At the level of
pragmatics,
AI and linguistic philosophy really came together with
Perrault and James Allen's (1980) plan-based
account of Searle's speech act theory (e.g., 1975).
The deficiencies of this model as a foundation for computational
work are at least implicitly now recognised by most
researchers in computational linguistics.
One major source of problems is the lack of availability
of world knowledge and full AI-style inference. The classic NLP
systems turned out to be toys that could not be scaled up from tiny
domains.
Some CL researchers, however, still believe in the
classic model for the linguistic aspects of processing and
believe that the role
of statistical techniques, as used for parse
selection for instance, is as an approximation to the use of
world knowledge. That is, that statistical techniques work
primarily because they make up for the failure of classical AI
in disambiguation and so on.
I think this is not the case and that a major part of the
effectiveness of data-driven and statistical techniques is in
compensating for the failures of the classic linguistic model to account
for some aspects of language processing,
in particular with respect to disambiguation
(and instantiation of underspecified
or vague terms), multiword expressions
and lexical semantics.
I think that language is best described as
semi-generative, involving a complex interaction between productivity
and conventionality. Data-driven techniques can
model classes of conventionality in language that are not
generally accounted for or even appreciated in formal linguistics. In
fact, I suspect that the majority of computational linguists now
believe that formal linguistics research is almost completely
irrelevant to their concerns. I hope that
this isn't the case, and that models which combine a classic
generative approach with a statistical approach that properly allows
for lexical phenomena will turn out to be the most useful for computational
work in the long term.
What I am interested in doing,
on a theoretical level, is discovering more about how conventionality
and generativity interact. On a more practical level, I am interested
in building large-scale systems that combine
work motivated by formal linguistics with data-driven approaches.
My most recent work that directly addresses the theoretical issues is not yet
published, but see:
Some notes about terminology: my use of `generative' may be confusing,
but I can't think of a more appropriate term, so I hope my use is clear
from context.
Some people talk about `empirical' methods
in CL in contrast to `theoretical' or `linguistic' approaches.
But large-scale, hand-built grammars of the sort that are available via
DELPH-IN (see below)
are necessarily constructed based on real data - this is just as
`empirical' as a grammar built automatically on the basis of
a hand-annotated corpus.
Annotators and grammar-writers alike are working on the basis of some
theoretical notion of syntax: DELPH-IN grammar writers are using a
more complex theory than underlies the
Penn Treebank, but that is not a fundamental difference.
By data-driven, I mean approaches which do not presuppose a
particular classification.
Examples are ngram language models,
word sense induction via clustering, vector space models
for lexical semantics and some approaches to collocation.
But in many cases, a data-driven statistical model
gives a distribution over a non-data-driven (predefined) classification.
Of course a statistical
model for disambiguation can be constructed
on the basis of a hand-built grammar: the Redwoods work demonstrates this for
DELPH-IN.
Some musings about Open Source in CL.
This means that it makes sense to concentrate on producing core
technology such as analysers and generators. The expectation is that these
will find many uses, so working on improving speed and robustness,
for instance, will ultimately turn out to be useful. It is
always important to test such technology on real data, preferably data that
is representative of some application, but the central idea
behind our research on DELPH-IN is that we can make progress by
building language processing modules that are (mostly) application independent.
In fact, systems very similar to NLIDs are now making a reappearance in
limited-domain question answering and email response where there is
no GUI alternative.
I am trying to make sure my research work is relevant
while avoiding the applications trap by working on a family
of applications connected with scientific text. This also
fits in with other work in the NLIP group. The idea behind
SciBorg is to try to
develop NLP tools and architecture with the goal of helping people
who need to access Chemistry texts. The project involves
text mining and information extraction, but we are considering
this very broadly: for instance, we are aiming to extract information
about the relationship between papers (via Simone Teufel's AZ techniques).
Combining generative and data-driven approaches to language
Formal linguistics has emphasised the idea of language as being essentially
generative: words can be combined into phrases, phrases into
utterances and so on, and meaning is built up compositionally.
Lexical meaning is minimally represented, typically
as a symbol corresponding to the lexeme which is to be understood
as a pointer to a set of entities (e.g., the meaning of the word
vote is represented as vote' with no explicit connection
to concepts such as election' or poll').
The compositionally-derived meaning of an
utterance may be highly ambiguous or underspecified.
A hearer is expected to use reasoning on the basis of world
knowledge and assumptions about the context and the speaker to fully
understand an utterance.
The project
`Integrating pragmatic insights with HPSG
is also related.
Tools and systems for computational linguistics
All my main work on building tools and systems is now distributed via
DELPH-IN.
Most of this concerns tools for generative approaches (in the sense discussed
above) although with suitable hooks for data-driven work (where someone
has figured out a good way of making the connection).
The LKB system
is well documented elsewhere. I am still very actively
involved in developing it. Current work is mostly
on semantics. The tools for processing
MRS and RMRS
are essentially independent of the LKB, although compatible with it.
I am committed to making everything available as Open Source
and to scale reasonably. Personally I think that one of the
DELPH-IN slogans should be No more toy systems!
since I think more damage has been done to CL by people claiming results
on the basis of tiny, flaky systems than by anything else. Not only does it
give outsiders
an unrealistic idea of the field, it blocks others getting research
grants to do something properly. Obviously people have to be able
to build research prototypes and to try things out before
moving to large-scale, more efficient implementations.
But there is no excuse for having a series of projects and
delivering throw-away `results'.
Representation issues in computational linguistics
Computational compositional semantics
The invited talk I gave at
EACL 2009 is the manifesto for
Slacker semantics (I am sure I'm going to regret using that name).
The link above points to the slides.
There is a paper in the proceedings:
Slacker Semantics:
Why Superficiality, Dependency and Avoidance of Commitment can be the
Right Way to Go but the slides are directed at a more general
audience and probably give a better idea of the background. In the
context of what I'm discussing here, the point is that we want a broad
coverage semantic representation which is a good model of what we get
from syntax and inflectional morphology, so it's important that we
don't overcommit. This is the area that I'm personally working on
most actively at the moment.
Language generation
I've mostly worked on what used to be called tactical generation,
which is the process of going from a logical form to a string. The term
realization is now more generally used, but often the
assumption is that the input to realization is closer to the syntax
than the logical representations that I work with. The LKB system
allows generation as well as parsing with the same grammars. I also
did some work with Advaith Siddharthan on generation of referring
expressions. With Guido Minnen and Francis Bond, I worked on
determiner choice and that work was followed up by two MPhil projects
which I supervised (by Gabriel Ezekiel and Lin Mei).
These slides illustrate how I
think this might all fit together. I think that the ability to use
the ERG with the PET and the LKB to generate and paraphrase could be
exploited in some interesting projects, although I haven't managed to
work out a compelling practical application yet. Incremental
generation is an area which has been neglected.
Lexical semantics
Lexical semantics (including multiword expressions): to be written.
Applications
I do not centre my long-term research goals around particular
applications. This is not because I think building working systems is
uninteresting - in fact it can be real validation of
the research. But what is useful and what isn't is hugely
dependent on factors other than the computational linguistic
research itself. Advances in computer speed and (more
importantly) storage, the Web, the move to electronic publication of
newspapers and books, the advent of GUIs and the ever-changing
obsessions of the main funders have had far more influence on NLP
research than anything that has come from within the field. To give
one example: natural language interfaces to databases (NLIDs) were one
of the main applications considered by NLP researchers from the late
1970s to the mid-1980s. Many usable systems were built and there were
several reasonably successful companies distributing such technology.
But the market ceased to exist, at least partly because of graphical
user interfaces which gave users an alternative to formal languages
such as SQL. Most of the research directed specifically at NLIDs
dried up.