Module Html5rw.Dom

DOM types and manipulation functions.

This module provides the core types for representing HTML documents as DOM trees. It includes:

HTML5 DOM Types and Operations

This module provides the DOM (Document Object Model) node representation used by the HTML5 parser. The DOM is a programming interface that represents an HTML document as a tree of nodes, where each node represents part of the document (an element, text content, comment, etc.).

What is the DOM?

When an HTML parser processes markup like <p>Hello <b>world</b></p>, it doesn't store the text directly. Instead, it builds a tree structure in memory:

Document
└── html
    └── body
        └── p
            ├── #text "Hello "
            └── b
                └── #text "world"

This tree is the DOM. Each box in the tree is a node. Programs can traverse and modify this tree to read or change the document.

Node Types

The HTML5 DOM includes several node types, all represented by the same record type with different field usage:

Namespaces

HTML5 can embed content from other XML vocabularies. Elements belong to one of three namespaces:

The parser automatically switches namespaces when entering and leaving these foreign content islands.

Tree Structure

Nodes form a bidirectional tree: each node has a list of children and an optional parent reference. Modification functions in this module maintain these references automatically.

The tree is always well-formed: a node can only have one parent, and circular references are not possible.

Types

type doctype_data = {
  1. name : string option;
    (*

    The DOCTYPE name, e.g., "html"

    *)
  2. public_id : string option;
    (*

    Public identifier (legacy, rarely used)

    *)
  3. system_id : string option;
    (*

    System identifier (legacy, rarely used)

    *)
}

Information associated with a DOCTYPE node.

The document type declaration (DOCTYPE) tells browsers what version of HTML the document uses. In HTML5, the standard declaration is simply:

<!DOCTYPE html>

This minimal DOCTYPE triggers standards mode (no quirks). The DOCTYPE can optionally include a public identifier and system identifier for legacy compatibility with SGML-based tools, but these are rarely used in modern HTML5 documents.

Historical context: In HTML4 and XHTML, DOCTYPEs were verbose and referenced DTD files. For example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
 "http://www.w3.org/TR/html4/strict.dtd">

HTML5 simplified this to just <!DOCTYPE html> because:

  • Browsers never actually fetched or validated against DTDs
  • The DOCTYPE's only real purpose is triggering standards mode
  • A minimal DOCTYPE achieves this goal

Field meanings:

  • name: The document type name, almost always "html" for HTML documents
  • public_id: A public identifier (legacy); None for HTML5
  • system_id: A system identifier/URL (legacy); None for HTML5
type quirks_mode =
  1. | No_quirks
  2. | Quirks
  3. | Limited_quirks

Quirks mode setting for the document.

Quirks mode is a browser rendering mode that emulates bugs and non-standard behaviors from older browsers (primarily Internet Explorer 5). Modern HTML5 documents should always render in standards mode (no quirks) for consistent, predictable behavior.

The HTML5 parser determines quirks mode based on the DOCTYPE declaration:

  • No_quirks (Standards mode): The document renders according to modern HTML5 and CSS specifications. This is triggered by <!DOCTYPE html>. CSS box model, table layout, and other features work as specified.
  • Quirks (Full quirks mode): The document renders with legacy browser bugs emulated. This happens when:

    • DOCTYPE is missing entirely
    • DOCTYPE has certain legacy public identifiers
    • DOCTYPE has the wrong format

In quirks mode, many CSS properties behave differently:

  • Tables don't inherit font properties
  • Box model uses non-standard width calculations
  • Certain CSS selectors don't work correctly
  • Limited_quirks (Almost standards mode): A middle ground that applies only a few specific quirks, primarily affecting table cell vertical sizing. Triggered by XHTML DOCTYPEs and certain HTML4 DOCTYPEs.

Recommendation: Always use <!DOCTYPE html> at the start of HTML5 documents to ensure No_quirks mode.

type node = {
  1. mutable name : string;
    (*

    Tag name for elements, or special name for other node types.

    For elements, this is the lowercase tag name (e.g., "div", "span"). For other node types, use the constants document_name, text_name, comment_name, etc.

    *)
  2. mutable namespace : string option;
    (*

    Element namespace: None for HTML, Some "svg", Some "mathml".

    Most elements are in the HTML namespace (None). The SVG and MathML namespaces are only used when content appears inside <svg> or <math> elements respectively.

    *)
  3. mutable attrs : (string * string) list;
    (*

    Element attributes as (name, value) pairs.

    Attributes provide additional information about elements. Common global attributes include:

    • id: Unique identifier for the element
    • class: Space-separated list of CSS class names
    • style: Inline CSS styles
    • title: Advisory text (shown as tooltip)
    • lang: Language of the element's content
    • hidden: Whether the element should be hidden

    Element-specific attributes include:

    • href on <a>: The link destination URL
    • src on <img>: The image source URL
    • type on <input>: The input control type
    • disabled on form controls: Whether the control is disabled

    In HTML5, attribute names are case-insensitive and are normalized to lowercase by the parser.

    *)
  4. mutable children : node list;
    (*

    Child nodes in document order.

    For most elements, this list contains the nested elements and text. For void elements (like <br>, <img>), this is always empty. For <template> elements, the actual content is in template_content, not here.

    *)
  5. mutable parent : node option;
    (*

    Parent node, None for root nodes.

    Every node except the Document node has a parent. This back-reference enables traversing up the tree.

    *)
  6. mutable data : string;
    (*

    Text content for text and comment nodes.

    For text nodes, this contains the actual text. For comment nodes, this contains the comment text (without the <!-- and --> delimiters). For other node types, this field is empty.

    *)
  7. mutable template_content : node option;
    (*

    Document fragment for <template> element contents.

    The <template> element holds "inert" content that is not rendered but can be cloned and inserted elsewhere. This field contains a document fragment with the template's content.

    For non-template elements, this is None.

    *)
  8. mutable doctype : doctype_data option;
    (*

    DOCTYPE information for doctype nodes.

    Only doctype nodes use this field; for all other nodes it is None.

    *)
}

A DOM node in the parsed document tree.

All node types use the same record structure. The name field determines the node type:

  • Element: the tag name (e.g., "div", "p", "span")
  • Text: "#text"
  • Comment: "#comment"
  • Document: "#document"
  • Document fragment: "#document-fragment"
  • Doctype: "!doctype"

Understanding Node Fields

Different node types use different combinations of fields:

Node Type         | name             | namespace | attrs | data | template_content | doctype
------------------|------------------|-----------|-------|------|------------------|--------
Element           | tag name         | Yes       | Yes   | No   | If <template>    | No
Text              | "#text"          | No        | No    | Yes  | No               | No
Comment           | "#comment"       | No        | No    | Yes  | No               | No
Document          | "#document"      | No        | No    | No   | No               | No
Document Fragment | "#document-frag" | No        | No    | No   | No               | No
Doctype           | "!doctype"       | No        | No    | No   | No               | Yes

Element Tag Names

For element nodes, the name field contains the lowercase tag name. HTML5 defines many elements with specific meanings:

Structural elements: html, head, body, header, footer, main, nav, article, section, aside

Text content: p, div, span, h1-h6, pre, blockquote

Lists: ul, ol, li, dl, dt, dd

Tables: table, tr, td, th, thead, tbody, tfoot

Forms: form, input, button, select, textarea, label

Media: img, audio, video, canvas, svg

Void Elements

Some elements are void elements - they cannot have children and have no end tag. These include: area, base, br, col, embed, hr, img, input, link, meta, source, track, wbr.

The Template Element

The <template> element is special: its children are not rendered directly but stored in a separate document fragment accessible via the template_content field. Templates are used for client-side templating where content is cloned and inserted via JavaScript.

Node Name Constants

These constants identify special node types. Compare with node.name to determine the node type.

val document_name : string

"#document" - name for document nodes.

The Document node is the root of every HTML document tree. It represents the entire document and is the parent of the <html> element.

val document_fragment_name : string

"#document-fragment" - name for document fragment nodes.

Document fragments are lightweight container nodes used to hold a collection of nodes without a parent document. They are used:

  • To hold <template> element contents
  • As results of fragment parsing (innerHTML)
  • For efficient batch DOM operations
val text_name : string

"#text" - name for text nodes.

Text nodes contain the character data within elements. When the parser encounters text between tags like <p>Hello world</p>, it creates a text node with data "Hello world" as a child of the <p> element.

Adjacent text nodes are automatically merged by the parser.

val comment_name : string

"#comment" - name for comment nodes.

Comment nodes represent HTML comments: <!-- comment text -->. Comments are preserved in the DOM but not rendered to users. They're useful for development notes or conditional content.

val doctype_name : string

"!doctype" - name for doctype nodes.

The DOCTYPE node represents the <!DOCTYPE html> declaration. It is always the first child of the Document node (if present).

Constructors

Functions to create new DOM nodes. All nodes start with no parent and no children. Use append_child or insert_before to build a tree.

val create_element : string -> ?namespace:string option -> ?attrs:(string * string) list -> unit -> node

Create an element node.

Elements are the primary building blocks of HTML documents. Each element represents a component of the document with semantic meaning.

  • parameter name

    The tag name (e.g., "div", "p", "span"). Tag names are case-insensitive in HTML; by convention, use lowercase.

  • parameter namespace

    Element namespace:

    • None (default): HTML namespace for standard elements
    • Some "svg": SVG namespace for graphics elements
    • Some "mathml": MathML namespace for mathematical notation
  • parameter attrs

    Initial attributes as (name, value) pairs

Examples:

  (* Simple HTML element *)
  let div = create_element "div" ()

  (* Element with attributes *)
  let link = create_element "a"
    ~attrs:[("href", "https://example.com"); ("class", "external")]
    ()

  (* SVG element *)
  let rect = create_element "rect"
    ~namespace:(Some "svg")
    ~attrs:[("width", "100"); ("height", "50"); ("fill", "blue")]
    ()
val create_text : string -> node

Create a text node with the given content.

Text nodes contain the readable content of HTML documents. They appear as children of elements and represent the characters that users see.

Note: Text content is stored as-is. Character references like &amp; should already be decoded to their character values.

Example:

  let text = create_text "Hello, world!"
  (* To put text in a paragraph: *)
  let p = create_element "p" () in
  append_child p text
val create_comment : string -> node

Create a comment node with the given content.

Comments are human-readable notes in HTML that don't appear in the rendered output. They're written as <!-- comment --> in HTML.

  • parameter data

    The comment text (without the <!-- and --> delimiters)

Example:

  let comment = create_comment " TODO: Add navigation "
  (* Represents: <!-- TODO: Add navigation --> *)
val create_document : unit -> node

Create an empty document node.

The Document node is the root of an HTML document tree. It represents the entire document and serves as the parent for the DOCTYPE (if any) and the root <html> element.

In a complete HTML document, the structure is:

#document
├── !doctype
└── html
    ├── head
    └── body
val create_document_fragment : unit -> node

Create an empty document fragment.

Document fragments are lightweight containers that can hold multiple nodes without being part of the main document tree. They're useful for:

  • Template contents: The <template> element stores its children in a document fragment, keeping them inert until cloned
  • Fragment parsing: When parsing HTML fragments (like innerHTML), the result is placed in a document fragment
  • Batch operations: Build a subtree in a fragment, then insert it into the document in one operation for better performance
val create_doctype : ?name:string -> ?public_id:string -> ?system_id:string -> unit -> node

Create a DOCTYPE node.

The DOCTYPE declaration tells browsers to use standards mode for rendering. For HTML5 documents, use:

  let doctype = create_doctype ~name:"html" ()
  (* Represents: <!DOCTYPE html> *)
  • parameter name

    DOCTYPE name (usually "html" for HTML documents)

  • parameter public_id

    Public identifier (legacy, rarely needed)

  • parameter system_id

    System identifier (legacy, rarely needed)

Legacy example:

  (* HTML 4.01 Strict DOCTYPE - not recommended for new documents *)
  let legacy = create_doctype
    ~name:"HTML"
    ~public_id:"-//W3C//DTD HTML 4.01//EN"
    ~system_id:"http://www.w3.org/TR/html4/strict.dtd"
    ()
val create_template : ?namespace:string option -> ?attrs:(string * string) list -> unit -> node

Create a <template> element with its content document fragment.

The <template> element holds inert HTML content that is not rendered directly. The content is stored in a separate document fragment and can be:

  • Cloned and inserted into the document via JavaScript
  • Used as a stamping template for repeated content
  • Pre-parsed without affecting the page

How templates work:

Unlike normal elements, a <template>'s children are not rendered. Instead, they're stored in the template_content field. This means:

  • Images inside won't load
  • Scripts inside won't execute
  • The content is "inert" until explicitly activated

Example:

  let template = create_template () in
  let div = create_element "div" () in
  let text = create_text "Template content" in
  append_child div text;
  (* Add to template's content fragment, not children *)
  match template.template_content with
  | Some fragment -> append_child fragment div
  | None -> ()

Node Type Predicates

Functions to test what type of node you have. Since all nodes use the same record type, these predicates check the name field to determine the actual node type.

val is_element : node -> bool

is_element node returns true if the node is an element node.

Elements are HTML tags like <div>, <p>, <a>. They are identified by having a tag name that doesn't match any of the special node name constants.

val is_text : node -> bool

is_text node returns true if the node is a text node.

Text nodes contain the character content within elements. They have name = "#text".

val is_comment : node -> bool

is_comment node returns true if the node is a comment node.

Comment nodes represent HTML comments <!-- ... -->. They have name = "#comment".

val is_document : node -> bool

is_document node returns true if the node is a document node.

The document node is the root of the DOM tree. It has name = "#document".

val is_document_fragment : node -> bool

is_document_fragment node returns true if the node is a document fragment.

Document fragments are lightweight containers. They have name = "#document-fragment".

val is_doctype : node -> bool

is_doctype node returns true if the node is a DOCTYPE node.

DOCTYPE nodes represent the <!DOCTYPE> declaration. They have name = "!doctype".

val has_children : node -> bool

has_children node returns true if the node has any children.

Note: For <template> elements, this checks the direct children list, not the template content fragment.

Tree Manipulation

Functions to modify the DOM tree structure. These functions automatically maintain parent/child references, ensuring the tree remains consistent.

val append_child : node -> node -> unit

append_child parent child adds child as the last child of parent.

The child's parent reference is updated to point to parent. If the child already has a parent, it is first removed from that parent.

Example:

  let body = create_element "body" () in
  let p = create_element "p" () in
  let text = create_text "Hello!" in
  append_child p text;
  append_child body p
  (* Result:
     body
     └── p
         └── #text "Hello!"
  *)
val insert_before : node -> node -> node -> unit

insert_before parent new_child ref_child inserts new_child before ref_child in parent's children.

  • parameter parent

    The parent node

  • parameter new_child

    The node to insert

  • parameter ref_child

    The existing child to insert before

Raises Not_found if ref_child is not a child of parent.

Example:

  let ul = create_element "ul" () in
  let li1 = create_element "li" () in
  let li3 = create_element "li" () in
  append_child ul li1;
  append_child ul li3;
  let li2 = create_element "li" () in
  insert_before ul li2 li3
  (* Result: ul contains li1, li2, li3 in that order *)
val remove_child : node -> node -> unit

remove_child parent child removes child from parent's children.

The child's parent reference is set to None.

Raises Not_found if child is not a child of parent.

val insert_text_at : node -> string -> node option -> unit

insert_text_at parent text before_node inserts text content.

If before_node is None, appends at the end. If the previous sibling is a text node, the text is merged into it (text nodes are coalesced). Otherwise, a new text node is created.

This implements the HTML5 parser's text insertion algorithm which ensures adjacent text nodes are always merged, matching browser behavior.

Attribute Operations

Functions to read and modify element attributes. Attributes are name-value pairs that provide additional information about elements.

In HTML5, attribute names are case-insensitive and normalized to lowercase by the parser.

val get_attr : node -> string -> string option

get_attr node name returns the value of attribute name, or None if the attribute doesn't exist.

Attribute lookup is case-sensitive on the stored (lowercase) names.

val set_attr : node -> string -> string -> unit

set_attr node name value sets attribute name to value.

If the attribute already exists, it is replaced. If it doesn't exist, it is added.

val has_attr : node -> string -> bool

has_attr node name returns true if the node has attribute name.

Tree Traversal

Functions to navigate the DOM tree.

val descendants : node -> node list

descendants node returns all descendant nodes in document order.

This performs a depth-first traversal, returning children before siblings at each level. The node itself is not included.

Document order is the order nodes appear in the HTML source: parent before children, earlier siblings before later ones.

Example:

  (* For tree: div > (p > "hello", span > "world") *)
  descendants div
  (* Returns: [p; text("hello"); span; text("world")] *)
val ancestors : node -> node list

ancestors node returns all ancestor nodes from parent to root.

The first element is the immediate parent, the last is the root (usually the Document node).

Example:

  (* For a text node inside: html > body > p > text *)
  ancestors text_node
  (* Returns: [p; body; html; #document] *)
val get_text_content : node -> string

get_text_content node returns the concatenated text content.

For text nodes, returns the text data directly. For elements, recursively concatenates all descendant text content. For other node types, returns an empty string.

Example:

  (* For: <p>Hello <b>world</b>!</p> *)
  get_text_content p_element
  (* Returns: "Hello world!" *)

Cloning

val clone : ?deep:bool -> node -> node

clone ?deep node creates a copy of the node.

  • parameter deep

    If true, recursively clone all descendants (default: false)

The cloned node has no parent. With deep:false, only the node itself is copied (with its attributes, but not its children).

Example:

  let original = create_element "div" ~attrs:[("class", "box")] () in
  let shallow = clone original in
  let deep = clone ~deep:true original

Serialization

val to_html : ?pretty:bool -> ?indent_size:int -> ?indent:int -> node -> string

to_html ?pretty ?indent_size ?indent node converts a DOM node to an HTML string.

  • parameter pretty

    If true (default), format with indentation and newlines

  • parameter indent_size

    Number of spaces per indentation level (default: 2)

  • parameter indent

    Starting indentation level (default: 0)

  • returns

    The HTML string representation of the node

val to_writer : ?pretty:bool -> ?indent_size:int -> ?indent:int -> Bytesrw.Bytes.Writer.t -> node -> unit

to_writer ?pretty ?indent_size ?indent writer node streams a DOM node as HTML to a bytes writer.

This is more memory-efficient than to_html for large documents as it doesn't build intermediate strings.

  • parameter pretty

    If true (default), format with indentation and newlines

  • parameter indent_size

    Number of spaces per indentation level (default: 2)

  • parameter indent

    Starting indentation level (default: 0)

  • parameter writer

    The bytes writer to output to

val to_test_format : ?indent:int -> node -> string

to_test_format ?indent node converts a DOM node to the html5lib test format.

This format is used by the html5lib test suite for comparing parser output. It represents the DOM tree in a human-readable, line-based format.

  • parameter indent

    Starting indentation level (default: 0)

  • returns

    The test format string representation

val to_text : ?separator:string -> ?strip:bool -> node -> string

to_text ?separator ?strip node extracts all text content from a node.

Recursively collects text from all descendant text nodes.

  • parameter separator

    String to insert between text nodes (default: " ")

  • parameter strip

    If true (default), trim whitespace from result

  • returns

    The concatenated text content