Overview
SciBorg is a four-year project, starting October 1 2005, funded by the EPSRC
under the programme for
Computer Science for e-Science.
The project is a collaboration between three groups at the University
of Cambridge:
We are cooperating with three major publishers:
The project summary and objectives are below. For further
information, please see the detailed
project description , which was based on the project proposal, and the developing SciBorg project wiki pages.
Summary
Many tools exist for processing natural languages, such as English, but there
is no single perfect system. Different approaches have different strengths and
weaknesses. For instance, some very fast processors are designed to make
decisions about part of speech: e.g., that `fly' in the sentence `You'll have
to fly' is a verb rather than a noun. Other processors can do much more: e.g.,
they realise that `you' will be doing the flying, and may be able to decide
whether `fly' is meant literally or idiomatically (in context). But such
`deep' systems are much slower at processing text and far more complex to build
than the simpler `shallow' systems. Therefore researchers in natural language
processing try to combine multiple systems in different ways, in particular so
that deep systems are only used on text that is identified as interesting by a
shallow system. However, progress has been hindered by the lack of a common
interface between systems.
We are developing a formal language which captures some aspects of the
meaning of natural language in a way that allows contributions from different
processors to be combined. The combined systems can be used to extract
knowledge from text for later machine use, or to give human browsers
information about the structure of texts and their interconnections. In this
project, we will use this approach to analyse research papers in Chemistry, so
that aspects of their meaning can be extracted and used in the Semantic Web.
For example, we can obtain information about how particular compounds are
synthesised and represent this so that researchers can look up the information
more easily. We are also trying to automatically discover information about
the meaning of terms used in Chemistry. For instance, our system might
discover that `an alkaloid is a type of azacycle' from the phrase `the concise
synthesis of naturally occurring alkaloids and other complex polycyclic
azacycles'. We will also analyse text structure so that we can tell whether
an author is agreeing with a previous publication or criticising it.
These tools will be combined in a complete system for use by working chemists
who will give us feedback on the results. We are collaborating with major
publishers who are allowing us to experiment with papers in their collections.
We expect to use a GRID of parallel computers to process tens of thousands of
papers in order to build a substantial knowledge base. At the end of the
project, we will investigate the extension of this work
to other sciences. However, the general approach will have wide application to
extraction of information from many types of text.
Objectives
To develop a natural-language oriented markup language which enables the
tight integration of partial information from a wide variety of language
processing tools, while being compatible with GRID and Web protocols and having
a sound logical basis consistent with Semantic Web standards.
To use this language as a basis for robust and extensible
extraction of information from scientific texts.
To model scientific argumentation and citation purpose in order
to support novel modes of information access.
To demonstrate the applicability of this infrastructure in a
real-world eScience environment
by developing technology for Information Extraction
and ontology construction applied to
Chemistry texts.
[Publications] [People] [Resources]