What is the future of the Internet? We see it as a new kind of distributed database, a single huge document, a tree of strings, divided and replicated across thousands of servers. What security mechanisms will encourage users to share control over data more widely? How far can we detach the question of who controls the servers from who controls the data?

Project Dendros —

Joint administration of structured data

Project Dendros creates a unifying data-management layer for Internet applications. Its core is a flexible and simple data model, essentially a tree of byte strings. This is combined with a modular processing model based on access filters, which implement functions such as storage management, remote read/write access, authentication, access control, revision control, and replication. A particular aim is the design of new cooperative administration concepts that enable participants with limited mutual trust to cooperate effectively in a single global name space.

Introduction

Dendros is a middleware layer that sits between a reliable transport mechanism – such as TCP/IP – and distributed applications. It differs substantially from procedural middleware systems (RPC, CORBA, Web Services) in that it stores the data itself, instead of just passing on method invocations to other databases. Dendros defines a modular set of methods for reading, editing and managing a standardized distributed data structure. It has more in common with a distributed file system, database, directory, or some peer-to-peer systems than with a remote procedure call facility. Applications cooperate via Dendros by loading their data into its globally distributed tree of strings. They then use the Dendros facilities to access, update, search, protect, replicate and version control this data.

The ultimate aim of Project Dendros is to create, study and optimize a unifying replacement technology for historically grown standards and technologies such as:

the Internet domain name system
traditional Unix-style file systems
hierarchical message and file structuring formats such as SGML/XML, MIME, ASN.1, RFC822 email headers, MIME, RFC2425 directory entries
directory systems such as X.500, LDAP
revision control systems such as CVS

We have identified two key research problems on which we need to work before our vision can be fully realized, and they are our focus for this year.

Generalized representation of hierarchical data

The first focus is the selection and definition of a flexible generic hierarchical data syntax and processing model that is

well-suited to break down the barrier between
- network name spaces (e.g., DNS domain names, X.500 distinguished names),
- file systems with flat files (e.g., POSIX file systems),
- file-internal structures (XML, ASN.1, MIME, and many more),
very easy to access, view, edit, and understand,
well-suited for use in revision-controlled storage systems,
well-suited for representing the structurally rich information that occurs in many administrative, scientific and publishing tasks and that maps badly to the relational database model,
defined strictly as a very simple abstract data model that can then be mapped with round-trip compatibility onto a range of encodings (including some well-established ones such as XML),
well-suited for efficient handling of large binary objects and lists.

No existing standard seems to be a good candidate. For instance, ASN.1 and SGML are early 1980s attempts to define generic models for structured file and packet formats. They found their way into some popular applications (X.509, HTML) but turned out to be rather inconvenient to use in practice. XML was later defined as a small subset of SGML’s syntax that can be parsed unambiguously without an externally supplied grammar. It is a significant improvement that generated enormous industry interest in generic hierarchical data models. However, having inherited SGML’s strong orientation towards being a text-document markup language, it still fails on many of the above requirements.

We have now found a suitable new data model, which we call the Compound Data Structure. Some of its key features are:

Data is structured into a tree of byte strings. This reflects the hierarchy found in most name spaces, file systems, data file headers, XML files, text documents, etc.
Nodes in this tree consist of
- a small integer number (“tag”)
- an arbitrary-length byte string
- a set of child trees
- a list of child trees
The set is implemented such that it can also be used as a mapping, that is, each set element can be used as a key with a second subtree associated as its value.
With the distinction of child nodes into set and list elements, our data model clarifies, whether the relative order of child nodes matters or not. Incorporating this information at the lowest level in the data model greatly helps in automatically merging concurrent updates.
(File names in a subdirectory or attributes in XML are examples of set elements, because they need to be unique, but the order in which they are added does not matter. Elements of a text document or priority list are examples for list elements, where the relative position is an integral part of the information stored.)
The tag value differentiates between a very small number of different string flavours. This enables binary transparency, backwards-compatible extension of both application and middleware command formats, and in-band error signaling. The tag eliminates the need for escape bytes in our data model, and all the complications and security risks they would bring.

For example, XML maps very naturally onto the compound model. Element names, attribute names, attribute values and “PCDATA” text are all represented as strings. XML attribute names need to be unique to an element and their relative order does not matter, therefore attribute names are stored in the set of child nodes of the corresponding element string and map to the corresponding value. XML child elements and text content, on the other hand, are stored in the list of child nodes. The tag distinguishes between element names and text content.

In XML terms, compounds can be understood as a generalization and simplification that removes a number of XML’s restrictions, without adding significantly to the complexity of the implementation. Compounds are in a sense a variant of XML in which attribute names and values can be entire XML documents again, and not just flat strings. In addition, every single node in the compound tree, not just elements as in XML, can be annotated with attributes.

File systems map just as easily to compounds. The mapping provided by the set of child nodes corresponds to a subdirectory. For the user of a compound storage system, there is no difference between XML attributes and file system subdirectories. Both are represented by the same set construct. Compounds can be seen as a generalization of file systems, where a node can be a subdirectory and a file at the same time (a useful notion already implemented by HTTP servers via index.html), and where a file is either a flat byte string or an XML-like tree structure.

The very natural and convenient mappings from file systems and many structured file formats to compounds make them an ideal choice as the underlying data model of a distributed storage architecture.

The abstract data model just outlined is merely the foundation of our work. It is then combined with a range of ways for encoding compounds as byte strings, optimized for different needs (human reading, use with plaintext editors, efficient access, compactness, sorting and hashing, etc.).

We also work on a processing model that we believe will be considerably more convenient to use for application developers than existing practice with XML. The key idea is that our compound API will provide a mechanism that enables applications to activate and deactivate various “filters”. These are software layers that provide transformed views of the accessed compound. Each filter interacts with its next higher and lower layer by exchanging compound read/write accesses. Filters add some value for the user, usually by providing an additional layer of abstraction.

A simple example of a filter would be one that transparently detects strings that contain compressed compounds, and decompresses and unpacks their content on-the-fly, like in a compressed file system. Dereferencing symbolic links transparently would be another one. Filters can be activated either by attribute strings with processing-instruction tags found in the stored compound, or by processing-instructions that are embedded in the access path in a request. Remote access to other servers is equally handled by filters, and with processing instructions embedded in query paths, the application has full control over which filters should be applied on a remote server or in the local compound access library. Filters are a kind of driver architecture for compound-access libraries. They embody much of the functional richness of a compound storage system in a very modular way, giving users great control over the kind and location of processing steps applied to their data.

Filters use the same simple compound access interface to communicate with the next layer below that they provide to the layer immediately above. They can therefore be stacked and reordered arbitrarily. The bottom layer below all filters can be a comparatively simple compound storage server, whereas the layer above all filters is the application.

We believe that compounds on their own are a promising successor for XML. We hope that their use will not remain restricted to the distributed data management system designed in this project, but will find their way into many other next-generation Internet technologies.

Revision-controlled symmetric replication among untrusted servers

The second focus of Project Dendros is to provide a distributed storage system that is designed to support the joint administration and long-term maintenance of data collections that are not “owned” by any single person or organization. We are particularly interested in the storage management needs of large, heterogeneous and potentially mutually suspicious communities. For example, many open-source projects or international standardization committees work towards improving a set of files or documents, but many of the members are far from trusting each other completely. In such environments, the object managed by the shared storage system is not just a set of files, but the revision history and audit log that led to their current state.

Existing practice in group storage management usually involves a single central server that defines what the authoritative current snapshot and revision history of a project’s files are. In commonly used repository management software (e.g., CVS, sourceforge.net), compromising the central server is sufficient to make modifications to the archive without automatically making every client in the group aware of the change. Furthermore, compromised central servers could generate significant confusion about the current state of the project by maintaining different conflicting revision histories for different group members. These issues are of particular concern with the development and maintenance of security-critical software that is used to build trusted computing bases of important infrastructure applications.

Potential contributors may also be reluctant to participate in projects where they will not have full control over the integrity of the storage server, and handing over control of the authoritative server from one person or organization to the next can become a major “political” decision.

We therefore aim for a storage architecture with the following properties:

Instead of a single authoritative central server, the data is managed by an equal group of peer servers.
Each of these servers independently verifies whether all access conditions specified are fulfilled for each change to the shared data and maintains an independent audit log.
Group members broadcast change requests to the group of peer servers.
Group members participate in a periodic check-point protocol that either confirms agreement within the peer group on a common revision history and audit log or that flags and records discrepancies.
Malicious behaviour of a subgroup of the servers will not interfere with the integrity and availability of the data on the correct peers.
Misbehaving servers can easily be identified from audit logs.
Groups can be configured to automatically isolate misbehaving servers.
Adding and removing new peers to the group is a relatively simple operation that does not affect the integrity and accountability guarantees that the other servers provide to the group.
New change requests can be incorporated in stages, following configurable review requirements. The right to add a change into a revision-history of a collection and to mark a new version as the new “current” version are orthogonal. This makes it easier for the community to grant write access to less trusted members. Dendros provides in a sense a “Trusted CVS”.
The rules for how stored data and its revision history can be modified is laid down in the “digital constitution” of a project, which identifies roles, rights and authorities via public keys and specifies the rules according to which these can be delegated and handed over to successors. Each server in the peer group independently verifies submitted update requests against the digital constitution, providing for its distributed enforcement.

The Dendros replication mechanism aims to allow anyone who might mistrust the operators of other servers to join the peer group with their own server, so as to provide a local guarantee for the availability of the data and the integrity of its audit information. We hope that this technology will help to encourage cooperation of heterogeneous groups over the Internet and reduce or delay the split-ups of collaborative projects into the use of separate storage archives. We aim to replace the need for handing over the effective authority over a storage archive from one individual to the next with a more gradual change of group membership, to help increase the continuity and lifetime of heterogeneous collaborations.

Some of the protocols needed to achieve these goals cause the data traffic between peers to grow quadratically with the number of participating servers. We therefore expect that Dendros will typically be used with not more than 20–30 servers in the “writer” peer group that accepts and agrees on update requests and develops the revision history forward. A larger set of “reader” servers (possibly many thousand) can nevertheless passively copy the updates in the order agreed by the “writers” and verify independently the audit information included. The “readers” independently ensure the availability of the data and its past versions for read access, but not for updates. We believe that a set of a few dozen independently run servers through which all updates need to pass is sufficient to support ensure the long-term continuity of projects. Given the benefits of running strong protocols designed to handle the risk of malicious servers, this approach seems to us preferable to the fully a symmetric networks suggested in many “peer-to-peer” projects.

With regard to practical application needs and demonstration examples, we are motivated in particular by the archival and file-management needs of long-term projects, with an initial focus on collaborative software development, international standardization, and scientific publication.

Documents

For more information on Project Dendros, please come back from time to time to our growing collection of working documents and papers:

Markus G. Kuhn, Steven J. Murdoch, Piotr Zieliński: Compounds: a next-generation hierarchical data model. Poster, first presented at Microsoft Research Academic Days, Dublin, April 2004.
Related literature:
- Steven Murdoch’s “Survey of general-purpose data-representation formats and markup languages” reviews existing proposals for generic file-formats and proposals for XML variants.
- Markus Kuhn’s “Probabilistic counting of large digital signature collections” is an early result from our investigations into what new mechanisms may be needed to implement “digital constitutions” in scalable distributed storage systems efficiently.

Software

We have implemented a proof-of-concept prototype library in Perl and have a first production application that benefits substantially from it (the ucampas web content-management system used for our main departmental web pages).

Markus G. Kuhn