Survey of general-purpose data-representation formats and markup languages

The definition of XML sparked significant interest in standardized storage and transfer of structured information. While at present, XML enjoys significant interest and investment from industry, it is only one example of a generic data structuring language. XML itself, for example, emerged merely as a simple subset of the Structured General Markup Language (SGML), a mid-1980s ISO standard.

A large number of alternative technologies for representing structured data have been proposed or are emerging. Many of these were proposed to address shortcomings of XML. Some are merely alternative syntaxes for XML's abstract data model ("infoset"). Others propose radically different data models, some of which are both simpler to handle and at the same time more expressive.

Some are designed for general use, while others have specific purposes. Each embodies different features and trade-offs, and, depending on the application, may present a better option than the currently very popular choice XML.

The aim of this brief survey is to collect a comprehensive list of existing markup languages and similar generic structured file formats, to review their differences and relative trade-offs, and to promote discussion about next generation standards. I will be adding information as I discover more technologies and proposals in this area. Any comments, suggestions for improvements or recommendations for other languages, formats, references, etc. are greatly appreciated. Please email me at Steven.Murdoch at cl.cam.ac.uk.

Annotated bookmark list

XML discussion websites

XML Alternatives: List of markup/data serialization languages, designed for purposes for which XML might be used.
C2 Wiki :: XmlSucks: Long discussion on problems with XML and alternatives, in particular S-Expressions are suggested as a preferable notation.
[xml-dev] A plea for Sanity: Email from Joe English about problems with XML Namespaces.
XML Sucks: A few links to websites discussing the disadvantages of XML
Kuro5hin :: XSLT, Perl, Haskell, & a word on language design: Article on XSLT, comparisons with other languages, and some problems with XSLT.
XML Suck: News site covering discussion on XML and its alternatives, not as disapproving of XML as the title suggests.

Markup languages and human readable serialization formats

YAML Ain't Markup Language

YAML is not a generalised markup language, but is instead a human readable serialisation format for scripting languages native data structures. Files may be serialised/deserialised in their entirety or may be processed as a stream. They are not designed for random access.

Collections A YAML file is made out of one or more documents, each of which contains exactly one collection. The collections supported are based on those provided to scripting languages like Perl and Python. These are "Sequence" — an ordered set of elements, and "Mapping" — an unordered association of unique keys to values. Elements of the collections (sequence members, mapping keys and values) may be other collections (so forming a generalised graph structure) or scalars.

Scalars Scalars are defined as a sequence of Unicode characters, however when parsed these may be given an explicit or implicit type from a type library. The type library may be a custom one linked to by the document or one of the standard YAML types, for which a shorthand is provided.

Alias Mechanism In order to serialise the data the in memory structure is flattened from a graph to a tree. The structure lost through this operation is replaced by alias mechanism. When a collection or a scalar is defined it may be assigned an alias name, this alias name may then be used as a member of a collection indicating that the referred scalar or collection is a member.

Syntax The YAML syntax is optimised for human readability rather than parsing so the syntax rules are complex, however in practice the parser will hide this complication from an application author. In XML whether whitespace should be passed to the application from a parser is unclear, however in YAML there is a clear definition of which whitespace is for helping indicate to a human reader that there is structure but not part of the data, and significant whitespace that should be passed to the application program. So that documents may be word-wrapped for easy reading, single new-lines in scalars are ignored, in a similar way to the LaTeX paragraph formatting rules.

Also whitespace is used to define the structure, in a similar way to Python and Haskell using indentation to indicate blocks. This makes documents easy to read but exceptions to the parsing rules are needed when significant whitespace needs to be included in a multi-line scalar.

While Unicode text may be directly included as UTF-8, UTF-16 (LE or BE) or UTF-32 (LE or BE), escape codes are defined to allow non-ASCII unicode characters to be represented in an ASCII file.

Elements of a sequence are indicated by a leading "-", mappings are indicated by the key, a ":", followed by a value. If the key is not a scalar then it must be preceded by a "? ".As mentioned before scope is indicated by indentation. For example:

List of scalars

- List element 1
- List element 2

Mapping of scalars to scalars

Key 1: Value 1
Key 2: Value 2

Mapping of scalars to sequences

Key A:
    - Element A1
    - Element A2
Key B:
    - Element B1

The representation of sequences and mapping also has a compact syntax which is identical from the data model perspective but may aid readability in some circumstances. Sequences are represented as comma separated lists in square brackets, mappings are represented as key ":" value pairs, separated by commas enclosed by curly brackets. For example the above mapping of scalars to sequences may be represented as:

Key A: [Element A1, Element A2]
Key B: [Element B1]

or as:

{Key A: [Element A1, Element A2], Key B: [Element B1]}

A parser must accept either syntax but a program producing YAML may use either. It is expected that libraries will output YAML in a way judged best for human readability

Documents are separated by "---"

Other features The syntax permits comments to be embedded in documents

Structure is shown through indentation (one or more spaces). Sequence items are denoted by a dash, and key value pairs within a map are separated by a colon.

Example from YAML website

--- !clarkevans.com/^invoice
invoice: 34843
date   : 2001-01-23
bill-to: &id001

    given  : Chris
    family : Dumars
    address:
        lines: |
            458 Walkman Dr.
            Suite #292
        city    : Royal Oak
        state   : MI
        postal  : 48046
ship-to: *id001
product:
    - sku         : BL394D
      quantity    : 4
      description : Basketball
      price       : 450.00
    - sku         : BL4438H
      quantity    : 1
      description : Super Hoop
      price       : 2392.00
tax  : 251.42
total: 4443.52
comments:>
    Late afternoon is best.
    Backup contact is Nancy
    Billsmer @ 338-4338.

LMNL (Layered Markup and Annotation Language)

LMNL (pronounced "liminal") is a general purpose text encoding langauge with a similar intended application domain to XML, but with a different data model. The key differences are that elements (known as ranges in LMNL) may overlap and that attributes (known as annotations) may have structure. Whereas XML is defined first as a syntax which then implies multiple data models, LMNL is defined by a data model and one example syntax is given.

Layers A LMNL document is made out of a tree of layers. A document has exactly one base text layer which is a list of Unicode characters. The document also has zero or more range layers. Each range layer has exactly one base layer and its content is a sequence of ranges. A layer L may be the base to zero or more layers. These layers are the overlays of layer L. The overlays of a layer form a set and do not have any ordering.

Ranges Each range belonging to layer L spans over the contents of the base of layer L. If the base is a text layer then the ranges span over characters, if the base is a range layer then the ranges span over ranges. Ranges may have a name, or they may be anonymous. A range belongs to exactly one layer. The range is defines by a start index and a length. Ranges may be of length 0, in which case they are points. A range has a list of zero or more annotations. The ranges belonging to a layer are assigned an order in the list of ranges, but in addition there is an implied order based on the start index.

Annotations Annotations are similar to XML attributes, with the notable difference that they may have structure. An annotation belongs to exactly one range or exactly one annotation. Each annotation must have a name, a sequence of annotations and a value. The value is a text layer and so can have layers and ranges as above.

Syntax Layers are declared using the processing instruction "[!layer name="layer_name" base="base_name"]", providing the layer name and the name of the base layer. Two layers are implicitly defined "#base" — representing the base text layer and "#default" representing the ranges that are not explicitly assigned to a layer.

Ranges are started by the "[range_name~layer_name}" tag and terminated by th "{range_name~layer_name]" tag. The range name may be empty, in which case it is anonymous, and the layer_name may be empty (and the "~" omitted), in which case the range belongs to the "#default" layer. Ranges may overlap, so in cases where it is ambiguous which range is to be closed, both the start and end tag can be assigned an ID to remove ambiguity. For example "[range_name~layer_name=key_id}".

Annotations use the same start and end tags as ranges and are inserted after the name of the range in the start tag, but before the "}". Alternativley annotations may be inserted in the end tag, after the name but before the "]". Annotations may also be added to other annotations in a similar manner. Since annotations can not overlap the end tag for an annotation may be abbreviated to "{]".

Other features LMNL also permits namespaces and entities in a similar but simplified way to XML and allows comments to be inserted in documents. There may be multiple documents in the above syntax that could produce a given document model, so any LMNL document can be represented as a "reified LMNL" document, in which ranges are defined over the document, ranges, annotations, text, and so on. The resulting data model preserves ordering information which would be lost in the original LMNL data model. Finally the specification also defines two subsets of the data model. In the "Flat" subset there is precisely one text layer and at most one range layer. In the "Tree" subset each layer must have at most one overlay, in one layer ranges must not overlap, no annotation value may have any overlays and no annotation may have any annotations. This subset may be represented as a node tree, and if the top level layer contains one range which spans over the entire document content then this may be represented as a well formed XML document.

C2 Wiki :: LayeredMarkupAnnotationLanguage

DL (A Streaming Data Language)

DL is a data (as opposed to document) representation format and so does not have an equivalent feature to XML attributes. It designed to work well with streaming so a document can have multiple entry points. Types are specified inside the file, rather than being inferred from an external schema. The type system is closely related to Java but is applicable to most modern language types.

Values Elements of a DL file are either values or structures. Values are atomic, and may be primitive (string, int, float, boolean), where the type is implied from the syntax, or constructed where the type is explicitly stated (dates, uris, etc...). The parameter to a constructed type is a value, which will be converted into the stated type.

Structures Structures are containers for values or other structures. There are three types — Maps, Vectors and Arrays. A Map is a mapping between a name and a value/structure, the value/structure may be absent. A vector is an ordered collection of values or structures. An array is a vector where each element is of the same type.

Syntax Primative values are represented in the normal way for programming languages:

int	123
float	123.12
string	"val"
wsstring	""some value""
boolean	true \| false

Constructed types are built from type, followed by the value from which they are to be constructed from, enclosed in brackets. For example:
uri("http://xml-labrador.sourceforge.net/dl/")

A map is a name (a sequence of characters, not in quotes), optionally followed by a structure or value.

A vector is enclosed in curly braces. Elements are separated by spaces, except where a name (implying an empty map) is followed by a value. Here a semicolon must be inserted so that the name is not associated with the value. An array is a vector where all the elements are of the same type

ONX (Open Node Syntax)

ONX is a more compact and easy to parse replacement to XML, for general use, but designed with a view to be used in RPC

Data Model A ONX stream is constructed from one or more anonymous information blocks. Each information block contains a sequence of nodes, each of which is either a Value Node or a Container Node. A value node has a name (which when converted to lowercase may not begin with "onx") and a sequence of zero or more strings. Strings are a sequence of arbitrary bytes. A container node contains a sequence of zero or more nodes, both value nodes and further container nodes.

Syntax An information block is started with ":onx{" and terminated with "}". Value nodes are started with ":", the node name, followed by "{". They are terminated by "}". The contents of a value node is a sequence of strings, enclosed in quotes. Inside a string, backslashes and quotation marks are escaped with a backslash. Null bytes are written as "\0" and other non-ASCII characters are written using their hexadecimal representation as "\xHH". Arbitrary bytes can be included in the file by preceding them with "\[HHHHHHHH]", where HHHHHHHH is the 32 bit hexadecimal representation of the length of the block in bytes. For the specified number of bytes, null bytes, quotation marks and backslashes will be treated literally. The length of such a block is limited to 2³²–1 bytes but more than one be used in each string. Container nodes are started with ":", the node name, followed by "{". They are terminated by "}". Elements in a container node are separated with whitespace.

Whitespace outside of strings is ignored by the parser. For debugging reasons whitespace can be used to aid human readability, but in production use it will normally be omitted. Also for debugging purposes, the end of a node may be explicitly stated, i.e. "}onx" for an information set, "}name" for a container node and "]name" for a value node. This increases the size of the stream but allows the parser to detect invalid nesting and may make it easier for a human to understand the file.

Property Lists

Plists are designed for the serialization of small amounts (less than a few hundred kilobytes) of data. They were part of the NeXT Step oprating system and now are used on Max OS X.

Data Model A Plist contains exactly one object, which may either be a container or a value. Containers contain further objects. The value types are CFString, representing a Unicode string, CFNumber, representing a numeric value, CFDate representing a date, CFBoolean representing a boolean and and CFData representing a sequence of bytes. The container objects are CFArray representing a sequence of objects and CFDictionary representing a mapping between keys and objects. Since there is no pointer type only trees may be created, not graphs.

Syntax Three syntaxes are available for Plists, all of which are transparent to the programmer since they are supported by libraries built into the operating system. The original syntax used by NeXT was ASCII based, while it supports archiving all types CFDate and CFNumber will be unarchived to CFStrings. Strings are enclosed in double quotes and are encoded in Unicode. Binary data is encoded in hexadecimal ASCII and are enclosed in angle brackets. Arrays are enclosed in parentheses and elements are demited by commas. Dictionaries are enclosed in curly braces and name=value pairs are separated by semicolons. For single work alphanumeric values quotes may be omitted. Whitespace outside of strings and inside binary data objects is ignored.

The new XML syntax has a root tag of <plist> containing exacly one object. String values are enclosed in a <string> tag, an integer in <integer>, a floating point number in a <real> tag, a date in a <date> tag, data is base-64 encoded in a <data>tag and booleans are either <true /> or <false />. An array is enclosed by an <array> tag an encloses further object tags. Dictionaries are enclosed by a <dict> tag, containing zero or more key-value pairs, each consisting of a <key> tag enclosing the key, followed by an object tag.

A binary represation also exists and this may be used via the operating system libraries

GODDAG (General Ordered-Descendant Directed Acyclic Graph)

GODDAG is a data model designed for the representation of (possibly overlapping) ranges over text. The structure is defined using graph theoretic terms but essentially it is a similar to a parse tree of SGML/XML, except that elements may have multiple parent nodes.

No syntax is defined for the model, however GODDAG is used as the basis of TexMecs.

A specification language

A specification language is based on standard mathematical concepts of sets, lists, relations, and functions. It has a clearly separated data-model and has an ASCII based representation. Acceptable values for a document are defined by a type declaration system. These types may be included in the document to provide validity checking or may be automatically inferred.

Scalar Types Documents are made out of collections, which contain further collections or scalars. The scalars defined are "Void" which contains only one value — void."Bool" contains true and false. "Num" is the set of whole numbers, "NatNum" is the set of positive whole numbers including zero, "PosNum" is the set of positive whole numbers excluding zero, "RatNum" is the set of rational numbers and "RealNum" is the set of real numbers.

The parameterized type "NumRange(f,t)" is the set of all real numbers from f to t, including both f and t. "Option(t)" includes all the values of type t and void. "Enum(id₁,...,id_n) is the type containing the listed ids.

Records The record collection type is defined as "Record(id₁:t₁,...,id_n:t_n), meaning that in each record value, the value of type t_x can be access through the name id_x.

Sets and lists A set is a collection of zero or more elements. Each element in the collection is of the same type, specified on declaration. Duplicates are not permitted. In a set defined by "Set(t)" the order of elements is not preserved, in a set defined by "OrderedSet(t)" the order of elements is preserved. In a list defined by "List(t)", order is preserved and duplicates are permitted.

Functions and Maps Functions are defined by "Function(t_d,t_r)" where t_d is the type of the domain and t_r is the type of the range. As in the normal mathematical sense, the function must provide a mapping from every element of the domain. A map is defined similarly as "Map(t_d,t_r)", however there is no restriction that all members of the domain are represented (i.e. it is partial function). There is also a corresponding "OrderedFunc(t_d,t_r)" and "OrderedMap(t_d,t_r)" where the elements of the domain are ordered. For Function and OrderedFunc the type provides a mapping between every element of the domain to exactly one element of the range. For Map and OrderedMap the type provides a mapping between a each element of a subset of the domain to exactly one element of the range.

Syntactic shorthand A relation is defined by "Relation(id₁:t₁,...,id_n:t_n)" , and is a set of records, where in each record the value of type t_x can be accessed through the name id_x. Also Functions and Maps can be defined as mapping records to records using the "Function(id_d1:t_d1,...,id_dn:t_dn -> id_r1:t_r1,...,id_rn:t_rn)" syntax.

Sets as types Types are essentially the set of valid values, therefore it is logical that values that are sets may be used as types themselves. This permits an assertion that one scalar value of a record is a member of the set which is also a member of the record. This may be done simply by stating the name of the set as the type of the scalar. If the set in question is at a higher nested level than the scalar to be defined then the name of the set is perpended by as many "^" characters as necessary to break out of the nested levels.

Alternative XML syntaxes

tXML (Terse XML)

tXML is a more concise alternative syntax to XML designed to be easier to read and write. Situations where tXML would be useful include configuration files, where mixed content is rare.

Syntax Tags are opened with "{" and closed with "}". Before the "{", the name of the tag is specified, followed by whitespace and name=value pairs for attributes, each of these is separated by whitespace. Attribute values only have to be enclosed in quotes if they include spaces or other special characters. Empty tags or wmay be terminated with a semicolon. To differentiate tags from text, character data is enclosed in quotes and followed by a semicolon. Tags which contain only character data may ommit the curly braces.

CDATA is started with "<[" and closed with "]>" Entity references are expanded within strings, as with XML. Processing instructions are normal tags prepended with "$PI", but shortcuts exist for the prologue XML and DOCTYPE declarations. Comments are as with Java, single line comments are started with "//" and multiline comments are started with "/*" and ended with "*/"

BM (BetterMarkup)

BM (Better Markup) is a variation on the XML syntax. It is designed to be more compact and easier for a human to write.

Changes

Rather than using SGML entities, BM used backslash to escape characters, in a similar way to C. Since & is no longer used for entities it does not need to be written as an entity reference.
Tags may be closed by "~", in a similar way to </> in SGML. Also all open tags are implicitly closed at the end of a file.
Processing instructions are started with "{*" followed immediately with the name of the instruction.
Multiline comments may be started with "{* " (the whitespace is necessary to prevent it being confused with a processing instruction.
Attributes do not need to be enclosed in quotes
CDATA elements are simply enclosed in curly braces
A default attribute can be set for each tag

SXML

Concrete representation of the XML Infoset in the form of S-expressions

Steven J. Murdoch

http://www.cl.cam.ac.uk/users/sjm217/projects/markup/survey/ – $Date: 2005-07-28 14:02:55 +0100 (Thu, 28 Jul 2005) $ .