SciBorg: Extracting the Science from Science Publications People Publications Resources

Overview

SciBorg is a four-year project, starting October 1 2005, funded by the EPSRC under the programme for Computer Science for e-Science. The project is a collaboration between three groups at the University of Cambridge:

We are cooperating with three major publishers:

The project summary and objectives are below. For further information, please see the detailed project description , which was based on the project proposal, and the developing SciBorg project wiki pages.

Summary

Many tools exist for processing natural languages, such as English, but there is no single perfect system. Different approaches have different strengths and weaknesses. For instance, some very fast processors are designed to make decisions about part of speech: e.g., that `fly' in the sentence `You'll have to fly' is a verb rather than a noun. Other processors can do much more: e.g., they realise that `you' will be doing the flying, and may be able to decide whether `fly' is meant literally or idiomatically (in context). But such `deep' systems are much slower at processing text and far more complex to build than the simpler `shallow' systems. Therefore researchers in natural language processing try to combine multiple systems in different ways, in particular so that deep systems are only used on text that is identified as interesting by a shallow system. However, progress has been hindered by the lack of a common interface between systems.

We are developing a formal language which captures some aspects of the meaning of natural language in a way that allows contributions from different processors to be combined. The combined systems can be used to extract knowledge from text for later machine use, or to give human browsers information about the structure of texts and their interconnections. In this project, we will use this approach to analyse research papers in Chemistry, so that aspects of their meaning can be extracted and used in the Semantic Web. For example, we can obtain information about how particular compounds are synthesised and represent this so that researchers can look up the information more easily. We are also trying to automatically discover information about the meaning of terms used in Chemistry. For instance, our system might discover that `an alkaloid is a type of azacycle' from the phrase `the concise synthesis of naturally occurring alkaloids and other complex polycyclic azacycles'. We will also analyse text structure so that we can tell whether an author is agreeing with a previous publication or criticising it.

These tools will be combined in a complete system for use by working chemists who will give us feedback on the results. We are collaborating with major publishers who are allowing us to experiment with papers in their collections. We expect to use a GRID of parallel computers to process tens of thousands of papers in order to build a substantial knowledge base. At the end of the project, we will investigate the extension of this work to other sciences. However, the general approach will have wide application to extraction of information from many types of text.

Objectives

  1. To develop a natural-language oriented markup language which enables the tight integration of partial information from a wide variety of language processing tools, while being compatible with GRID and Web protocols and having a sound logical basis consistent with Semantic Web standards.

  2. To use this language as a basis for robust and extensible extraction of information from scientific texts.

  3. To model scientific argumentation and citation purpose in order to support novel modes of information access.

  4. To demonstrate the applicability of this infrastructure in a real-world eScience environment by developing technology for Information Extraction and ontology construction applied to Chemistry texts.

[Publications] [People] [Resources]