
Some terminology related to the PlexTree architecture
-----------------------------------------------------


Dendros -- This is the name of a research project led by Markus Kuhn
at the University of Cambridge Computer Laboratory, to design and
explore a next-generation software architecture centered around a
hierarchical data model for semi-structured data with associated
processing model.

PlexTree -- This is the name of a software architecture that emerged
from the Dendros project. It is centered around the "compound" as a
general-purpose hierarchical data model for semi-structured data, and
the "filter" as the primary mechanism for implementing reusable
processing functionality for these compounds.

perl-PlexTree -- This is the initial prototype and reference
implementation PlexTree library in Perl 5.8.

compound -- A compound is a recursively defined data structure, a kind
of tree of byte strings. It can be used as a hierarchical general
purpose data structure, particularly suitable as a framework for
defining file formats and database schemata. Each compound consists of
a set of nodes, of which one represents the root of the compound. In
addition, each compound also contains a set of "keys", which are
themselves complete compounds (usually very small ones). The remaining
nodes can be classified into "list elements" and "set elements". Each
node N has associated with it a (possibly empty) list of nodes ("list
elements"), and all nodes in that list have N as their designated
parent node. Each node N further has associated with it a set of
key-value pairs ("directory"), where each key is a compound (with its
own root node) and the value is a set-element node. Both the
list-element and the set-element nodes again have their respective own
list of nodes and set of key-value pairs associated with them. This
way, each node in a compound can be considered to form the root of a
subtree that again has all of the characteristics of a compound
(except that its root has a parent node). All the key compounds listed
in the directory of a particular node differ pairwise in at least one
of their nodes. Therefore, the combination of a node and a key
compound found in its directory uniquely defines one value node. The
distinction between list nodes and set nodes clarifies for each node
whether its relative position among its siblings matters (as it does
in a list) or not (like in a set).

node -- A node is a position in a compound with which a string, a tag
value, a list of further nodes and a set of key-value pairs is
associated. Each of these subnodes (both list and set elements) can
equally have further subnodes.

cursor -- A cursor is a reference to a particular node in a particular
compound. Such cursors are the primary object handles used in PlexTree
library APIs.

string -- This is the sequence of zero or more arbitrary 8-bit bytes
that is associated with each node in a compound. (The abstract
definition of compounds does not impose a maximum length on strings.
In typical implementations, string lengths will have some upper limit,
for example the available memory, 2^31 bytes or 2^60 bytes. Portable
applications avoid the use of strings substantially larger than one
gigabyte.)

tag -- A tag is an unsigned 4-bit integer in the range 0..15 that is
associated with each node in a compound. It can be used to
differentiate semantically between different types of strings and
thereby helps to avoid the use of escape sequences or the need for
restrictions on the byte sequences that can appear in a string (binary
transparency). Tags in the range 0..7 imply that the string is meant
to be interpreted as a UTF-8-encoded character sequence.

control string -- The string of a node with tag value 0.

text string -- The string of a node with tag value 1.

meta string -- The string of a node with tag value 2.

compound interface -- A program interface (API) offered by a "compound
provider" to a "compound user" through which the compound user can
query and modify a compound. Such an interface typically provides to
the user cursor objects and associated methods for changing between
nodes by moving "downwards" to list- and set-element nodes and
"upwards" to the next parent node, querying and modifying the tag and
string of the current node, as well as enumerating, adding and
removing list and directory entries.

invisible key -- A key compound that is not enumerated in the
directory of a node to which a cursor refers, but that nevertheless
has an associated value that can be accessed by the user through the
compound interface.

visible compound -- the set of nodes and keys that can be accessed by
a compound user by descending only into either list elements or
key-value pairs whose keys ("visible keys") are enumerated in the
directory of the respective parent node.

filter -- A filter is a software function that acts as both a compound
provider and as a compound user. It takes information found in the
"bottom" compound that it can access as a compound user and uses that
(and in some cases also other, external information) to generate a new
"top" compound that it offers to a user as a compound provider. Since
filters use the same type of compound interface on both sides
("upwards" as a compound provider and "downwards" as a compound user),
they can be stacked, such that a compound may undergo several filter
transforms before it is seen by the top-level user. Filters can be
divided into "substitution filters" and "augmentation filters", based
on how their transformation is triggered. They can also be divided
into "batch filters" and "on-the-fly filters", based on how they
transform the bottom compound into the corresponding top compound.


substitution filter (subfil) -- A filter that is activated by the
presence of a control string in a node and which acts to replace from
the user's point of view the entire subcompound from that node on
downwards (including the control string) with a different compound.
The activity of a substitution filter is triggered by the bottom
compound.

augmentation filter (augfil) -- A filter that is triggered by an
attempt to descend into a set-element subnode via an invisible key and
which acts to generate that otherwise non-existing key. The activity
of an augmentation filter is triggered by the compound user above it.

batch filter -- A filter that, when activated, immediately fetches all
the information it needs from the bottom compound in order to build a
complete in-memory representation of the generated top compound. Batch
filters are typically used if the top compound is very small or if it
is not feasible to handle accesses to a query to a small part of the
top compound by querying only a small part of the bottom compound.

on-the-fly filter -- A filter that defers and limits accesses to the
bottom compound to the minimum necessary for handling requested
accesses to the top compound. Such filters can be more efficient if
accesses to only a small part of the top compound can be resolved by
accessing only a small part of the bottom compound.

XML -- XML is another general-purpose hierarchical data model and a
plain-text representation form, both derived from a much simplified
version of the ISO 8879 Standard Generalized Markup Language (SGML).
One of several important motivating factors for creating the PlexTree
architecture was to find a cleaner and more versatile replacement for
XML.

CRF -- A compound representation form (CRF) is an byte-representation
of a visible finite compound. Compounds are abstract data models that
can be represented in numerous forms as byte sequences, memory data
structures, or even graphical diagrams. A CRF maps the abstract data
model of a compound into a concrete representation that is readable
and manipulatable for machines or humans. A number of standard CRFs
have been defined for interoperable exchange of compound data, others
are implementation specific.

CRF-S -- The sortable compound representation form (CRF-S) is an
unambiguous byte-encoding of a compound, whose lexicographical sorting
order (on byte strings) corresponds to a simple sorting order that can
be defined over the set of all finite compounds. For each possible
finite visible compound there exists exactly one byte sequence that is
its CRF-S encoding.

CRF-T -- The textual compound representation form (CRF-T) is a
human-readable plain-text representation of a visible finite compound.
It is optimized to be easy to edit with standard plain-text editors
(e.g., emacs, notepad) and is particularly intended for use by
developers, system administrators and experienced users of
PlexTree-based applications.
