FRESCO Logo CL 75th Anniversary Poster
FRESCO

The Fabric For Reproducible Computing (FRESCO) project in the Digital Technology Group at the University of Cambridge Computer Laboratory is carrying out research into the collection and use of data and computational provenance for the purposes of providing support for Lineage and Trust in big data.

In particular, the group is pursing three research directions:

Observed Provenance in User Space (OPUS):

We have built an always-on computation and data provenance collection system to support scientific computation. This system transparently records, at a fine-grained level, and with less than 10% performance degradation, all function calls made by the process. It achieves this without requiring application source code or any changes to the runtime environment. The collected data may be used to answer queries to support lineage and trust , e.g. a user can ask “what was the sequence of steps that resulted in the creation of this graph?”. Furthermore the system enables users to package code and data in a single atomic and executable unit to provide to others thereby supporting computational reproducibility.

This has resulted in the following publications:

Documents

Provenance For MapReduce

The group has built a fine-grained (at key-value level) lineage collection and storage system for MapReduce that imposes time and space overheads of less than 10% in the common case. Users use this system to carry out fine-grained causality analysis, debug MapReduce jobs, fine-grained auditing and information flow analysis of Mapreduce data.

This has resulted in the following publications:

Resourceful

The group has built a fine-grained kernel-level resource accounting system that supports computational provenance by enabling users to correlate synchronous and aynchronous Operating System kernel resource (CPU, memory, network and disk I/O) usage to applications at the system-call level. In doing so the system provides infrastructure to support fine-grained resource charging mechanisms, support for identification of unexpected or unwarranted kernel-level resource consumption and the ability to model and predict the global side-effects of process-local actions.

This has resulted in the following publication:

It is our goal to distill the lessons learnt building and using these systems into a set of general purpose principles that will be incorporated into future systems. It is our vision to promote the collection and use of provenance for supporting lineage and trust in big-data to a first-class principle of future computation systems.

Focus on Big Data

In the last half-decade there has been a growing awareness of the social, scientific and commercial value of Big Data. Big data based techniques have revolutionalised existing displines such as genomics, augment existing commercial spheres such as web analytics and make possible new domains such as behavioural analysis.

It is clear that the adoption of big-data in both the academic and commercial realms is of benefit. However, this paradigm encompasses two major problems that are big enough to limit its longer term value if left unaddressed:

Lineage

Big data computations will originate from a myriad of input sources. As computation chains become longer the number of input sources will increase. Similarly, as the complexity and interdisciplinary nature of computations increases, inputs are likely to increase in diversity. It is imperative that consumers of derived big data outputs can identify all data sources used to produce the result for the purposes of precisely and accurately establishing inputs used in computations as well as being able to attest to non-technical properties of the dataset such as legality of usage of input sources.

Trust

As people use datasets as primary evidence for scientific, social and commercial decisions it is important they are able to validate and verify the computation change leading to the output. In particular, it is important users are able to reproduce computations and identify where and why results diverge from the original output for the purposes of establishing confidence that the computation is replicable and thus trusted.

Other Work

The team has also worked on a small number of other pieces of work:

A general primer on provenance and the state of the art:

An improved provenance API