Dom (html5rw.Html5rw.Dom)

What is the DOM?

When an HTML parser processes markup like <p>Hello <b>world</b></p>, it doesn't store the text directly. Instead, it builds a tree structure in memory:

Document
└── html
    └── body
        └── p
            ├── #text "Hello "
            └── b
                └── #text "world"

This tree is the DOM. Each box in the tree is a node. Programs can traverse and modify this tree to read or change the document.

see https://html.spec.whatwg.org/multipage/dom.html
WHATWG: The elements of HTML (DOM chapter)

Node Types

The HTML5 DOM includes several node types, all represented by the same record type with different field usage:

Element nodes: HTML elements like <div>, <p>, <a href="...">. Elements are the building blocks of HTML documents. They can have attributes and contain other nodes.

Text nodes: The actual text content within elements. For example, in <p>Hello</p>, "Hello" is a text node that is a child of the <p> element.

Comment nodes: HTML comments written as . Comments are preserved in the DOM but not rendered.

Document nodes: The root of the entire document tree. Every HTML document has exactly one Document node at the top.

Document fragment nodes: Lightweight containers that hold a collection of nodes without a parent. Used for efficient batch DOM operations and <template> element contents.

Doctype nodes: The <!DOCTYPE html> declaration at the start of HTML5 documents. This declaration tells browsers to render the page in standards mode.

see https://html.spec.whatwg.org/multipage/dom.html#kinds-of-content
WHATWG: Kinds of content

Namespaces

HTML5 can embed content from other XML vocabularies. Elements belong to one of three namespaces:

HTML namespace (None or implicit): Standard HTML elements like <div>, <p>, <table>. This is the default for all elements.

SVG namespace (Some "svg"): Scalable Vector Graphics for drawing. When the parser encounters an <svg> tag, all elements inside it (like <rect>, <circle>, <path>) are placed in the SVG namespace.

MathML namespace (Some "mathml"): Mathematical Markup Language for equations. When the parser encounters a <math> tag, elements inside it are placed in the MathML namespace.

The parser automatically switches namespaces when entering and leaving these foreign content islands.

see https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inforeign
WHATWG: Parsing foreign content

Tree Structure

Nodes form a bidirectional tree: each node has a list of children and an optional parent reference. Modification functions in this module maintain these references automatically.

The tree is always well-formed: a node can only have one parent, and circular references are not possible.

Types

type doctype_data = {

name : string option;
(*
The DOCTYPE name, e.g., "html"
*)
public_id : string option;
(*
Public identifier (legacy, rarely used)
*)
system_id : string option;
(*
System identifier (legacy, rarely used)
*)

}

Information associated with a DOCTYPE node.

The document type declaration (DOCTYPE) tells browsers what version of HTML the document uses. In HTML5, the standard declaration is simply:

<!DOCTYPE html>

This minimal DOCTYPE triggers standards mode (no quirks). The DOCTYPE can optionally include a public identifier and system identifier for legacy compatibility with SGML-based tools, but these are rarely used in modern HTML5 documents.

Historical context: In HTML4 and XHTML, DOCTYPEs were verbose and referenced DTD files. For example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
 "http://www.w3.org/TR/html4/strict.dtd">

HTML5 simplified this to just <!DOCTYPE html> because:

Browsers never actually fetched or validated against DTDs
The DOCTYPE's only real purpose is triggering standards mode
A minimal DOCTYPE achieves this goal

Field meanings:

name: The document type name, almost always "html" for HTML documents
public_id: A public identifier (legacy); None for HTML5
system_id: A system identifier/URL (legacy); None for HTML5

see https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
WHATWG: The DOCTYPE

see https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode
WHATWG: DOCTYPE handling during parsing

type quirks_mode =

| No_quirks
| Quirks
| Limited_quirks

Quirks mode setting for the document.

Quirks mode is a browser rendering mode that emulates bugs and non-standard behaviors from older browsers (primarily Internet Explorer 5). Modern HTML5 documents should always render in standards mode (no quirks) for consistent, predictable behavior.

The HTML5 parser determines quirks mode based on the DOCTYPE declaration:

No_quirks (Standards mode): The document renders according to modern HTML5 and CSS specifications. This is triggered by <!DOCTYPE html>. CSS box model, table layout, and other features work as specified.

Quirks (Full quirks mode): The document renders with legacy browser bugs emulated. This happens when:
- DOCTYPE is missing entirely
- DOCTYPE has certain legacy public identifiers
- DOCTYPE has the wrong format

In quirks mode, many CSS properties behave differently:

Tables don't inherit font properties
Box model uses non-standard width calculations
Certain CSS selectors don't work correctly

Limited_quirks (Almost standards mode): A middle ground that applies only a few specific quirks, primarily affecting table cell vertical sizing. Triggered by XHTML DOCTYPEs and certain HTML4 DOCTYPEs.

Recommendation: Always use <!DOCTYPE html> at the start of HTML5 documents to ensure No_quirks mode.

see https://quirks.spec.whatwg.org/
Quirks Mode Standard - detailed specification

see https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode
WHATWG: How the parser determines quirks mode

type node = {

mutable name : string;
(*
Tag name for elements, or special name for other node types.
For elements, this is the lowercase tag name (e.g., "div", "span"). For other node types, use the constants document_name, text_name, comment_name, etc.
*)
mutable namespace : string option;
(*
Element namespace: None for HTML, Some "svg", Some "mathml".
Most elements are in the HTML namespace (None). The SVG and MathML namespaces are only used when content appears inside <svg> or <math> elements respectively.
- see https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom
  WHATWG: Elements in the DOM
*)
mutable attrs : (string * string) list;
(*
Element attributes as (name, value) pairs.
Attributes provide additional information about elements. Common global attributes include:
- id: Unique identifier for the element
- class: Space-separated list of CSS class names
- style: Inline CSS styles
- title: Advisory text (shown as tooltip)
- lang: Language of the element's content
- hidden: Whether the element should be hidden
Element-specific attributes include:
- href on <a>: The link destination URL
- src on <img>: The image source URL
- type on <input>: The input control type
- disabled on form controls: Whether the control is disabled
In HTML5, attribute names are case-insensitive and are normalized to lowercase by the parser.
- see https://html.spec.whatwg.org/multipage/dom.html#global-attributes
  WHATWG: Global attributes
- see https://html.spec.whatwg.org/multipage/indices.html#attributes-3
  WHATWG: Index of attributes
*)
mutable children : node list;
(*
Child nodes in document order.
For most elements, this list contains the nested elements and text. For void elements (like <br>, <img>), this is always empty. For <template> elements, the actual content is in template_content, not here.
*)
mutable parent : node option;
(*
Parent node, None for root nodes.
Every node except the Document node has a parent. This back-reference enables traversing up the tree.
*)
mutable data : string;
(*
Text content for text and comment nodes.
For text nodes, this contains the actual text. For comment nodes, this contains the comment text (without the  delimiters). For other node types, this field is empty.
*)
mutable template_content : node option;
(*
Document fragment for <template> element contents.
The <template> element holds "inert" content that is not rendered but can be cloned and inserted elsewhere. This field contains a document fragment with the template's content.
For non-template elements, this is None.
- see https://html.spec.whatwg.org/multipage/scripting.html#the-template-element
  WHATWG: The template element
*)
mutable doctype : doctype_data option;
(*
DOCTYPE information for doctype nodes.
Only doctype nodes use this field; for all other nodes it is None.
*)

}

A DOM node in the parsed document tree.

All node types use the same record structure. The name field determines the node type:

Element: the tag name (e.g., "div", "p", "span")
Text: "#text"
Comment: "#comment"
Document: "#document"
Document fragment: "#document-fragment"
Doctype: "!doctype"

Understanding Node Fields

Different node types use different combinations of fields:

Node Type         | name             | namespace | attrs | data | template_content | doctype
------------------|------------------|-----------|-------|------|------------------|--------
Element           | tag name         | Yes       | Yes   | No   | If <template>    | No
Text              | "#text"          | No        | No    | Yes  | No               | No
Comment           | "#comment"       | No        | No    | Yes  | No               | No
Document          | "#document"      | No        | No    | No   | No               | No
Document Fragment | "#document-frag" | No        | No    | No   | No               | No
Doctype           | "!doctype"       | No        | No    | No   | No               | Yes

Element Tag Names

For element nodes, the name field contains the lowercase tag name. HTML5 defines many elements with specific meanings:

Structural elements: html, head, body, header, footer, main, nav, article, section, aside

Text content: p, div, span, h1-h6, pre, blockquote

Lists: ul, ol, li, dl, dt, dd

Tables: table, tr, td, th, thead, tbody, tfoot

Forms: form, input, button, select, textarea, label

Media: img, audio, video, canvas, svg

see https://html.spec.whatwg.org/multipage/indices.html#elements-3
WHATWG: Index of HTML elements

Void Elements

Some elements are void elements - they cannot have children and have no end tag. These include: area, base, br, col, embed, hr, img, input, link, meta, source, track, wbr.

see https://html.spec.whatwg.org/multipage/syntax.html#void-elements
WHATWG: Void elements

The Template Element

The <template> element is special: its children are not rendered directly but stored in a separate document fragment accessible via the template_content field. Templates are used for client-side templating where content is cloned and inserted via JavaScript.

see https://html.spec.whatwg.org/multipage/scripting.html#the-template-element
WHATWG: The template element

Node Name Constants

These constants identify special node types. Compare with node.name to determine the node type.

val document_name : string

"#document" - name for document nodes.

The Document node is the root of every HTML document tree. It represents the entire document and is the parent of the <html> element.

see https://html.spec.whatwg.org/multipage/dom.html#document
WHATWG: The Document object

val document_fragment_name : string

"#document-fragment" - name for document fragment nodes.

Document fragments are lightweight container nodes used to hold a collection of nodes without a parent document. They are used:

To hold <template> element contents
As results of fragment parsing (innerHTML)
For efficient batch DOM operations

see https://dom.spec.whatwg.org/#documentfragment
DOM Standard: DocumentFragment

val text_name : string

"#text" - name for text nodes.

Text nodes contain the character data within elements. When the parser encounters text between tags like <p>Hello world</p>, it creates a text node with data "Hello world" as a child of the <p> element.

Adjacent text nodes are automatically merged by the parser.

val comment_name : string

"#comment" - name for comment nodes.

Comment nodes represent HTML comments: . Comments are preserved in the DOM but not rendered to users. They're useful for development notes or conditional content.

val doctype_name : string

"!doctype" - name for doctype nodes.

The DOCTYPE node represents the <!DOCTYPE html> declaration. It is always the first child of the Document node (if present).

see https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
WHATWG: The DOCTYPE

Constructors

Functions to create new DOM nodes. All nodes start with no parent and no children. Use append_child or insert_before to build a tree.

val create_element : 
  string ->
  ?namespace:string option ->
  ?attrs:(string * string) list ->
  unit ->
  node

Create an element node.

Elements are the primary building blocks of HTML documents. Each element represents a component of the document with semantic meaning.

parameter name
The tag name (e.g., "div", "p", "span"). Tag names are case-insensitive in HTML; by convention, use lowercase.

parameter namespace
Element namespace:
- None (default): HTML namespace for standard elements
- Some "svg": SVG namespace for graphics elements
- Some "mathml": MathML namespace for mathematical notation

parameter attrs
Initial attributes as (name, value) pairs

Examples:

  (* Simple HTML element *)
  let div = create_element "div" ()

  (* Element with attributes *)
  let link = create_element "a"
    ~attrs:[("href", "https://example.com"); ("class", "external")]
    ()

  (* SVG element *)
  let rect = create_element "rect"
    ~namespace:(Some "svg")
    ~attrs:[("width", "100"); ("height", "50"); ("fill", "blue")]
    ()

see https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom
WHATWG: Elements in the DOM

val create_text : string -> node

Create a text node with the given content.

Text nodes contain the readable content of HTML documents. They appear as children of elements and represent the characters that users see.

Note: Text content is stored as-is. Character references like & should already be decoded to their character values.

Example:

  let text = create_text "Hello, world!"
  (* To put text in a paragraph: *)
  let p = create_element "p" () in
  append_child p text

val create_comment : string -> node

Create a comment node with the given content.

Comments are human-readable notes in HTML that don't appear in the rendered output. They're written as  in HTML.

parameter data
The comment text (without the  delimiters)

Example:

  let comment = create_comment " TODO: Add navigation "
  (* Represents: <!-- TODO: Add navigation --> *)

see https://html.spec.whatwg.org/multipage/syntax.html#comments
WHATWG: HTML comments

val create_document : unit -> node

Create an empty document node.

The Document node is the root of an HTML document tree. It represents the entire document and serves as the parent for the DOCTYPE (if any) and the root <html> element.

In a complete HTML document, the structure is:

#document
├── !doctype
└── html
    ├── head
    └── body

see https://html.spec.whatwg.org/multipage/dom.html#document
WHATWG: The Document object

val create_document_fragment : unit -> node

Create an empty document fragment.

Document fragments are lightweight containers that can hold multiple nodes without being part of the main document tree. They're useful for:

Template contents: The <template> element stores its children in a document fragment, keeping them inert until cloned

Fragment parsing: When parsing HTML fragments (like innerHTML), the result is placed in a document fragment

Batch operations: Build a subtree in a fragment, then insert it into the document in one operation for better performance

see https://dom.spec.whatwg.org/#documentfragment
DOM Standard: DocumentFragment

val create_doctype : 
  ?name:string ->
  ?public_id:string ->
  ?system_id:string ->
  unit ->
  node

Create a DOCTYPE node.

The DOCTYPE declaration tells browsers to use standards mode for rendering. For HTML5 documents, use:

  let doctype = create_doctype ~name:"html" ()
  (* Represents: <!DOCTYPE html> *)

parameter name
DOCTYPE name (usually "html" for HTML documents)

parameter public_id
Public identifier (legacy, rarely needed)

parameter system_id
System identifier (legacy, rarely needed)

Legacy example:

  (* HTML 4.01 Strict DOCTYPE - not recommended for new documents *)
  let legacy = create_doctype
    ~name:"HTML"
    ~public_id:"-//W3C//DTD HTML 4.01//EN"
    ~system_id:"http://www.w3.org/TR/html4/strict.dtd"
    ()

see https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
WHATWG: The DOCTYPE

val create_template : 
  ?namespace:string option ->
  ?attrs:(string * string) list ->
  unit ->
  node

Create a <template> element with its content document fragment.

The <template> element holds inert HTML content that is not rendered directly. The content is stored in a separate document fragment and can be:

Cloned and inserted into the document via JavaScript
Used as a stamping template for repeated content
Pre-parsed without affecting the page

How templates work:

Unlike normal elements, a <template>'s children are not rendered. Instead, they're stored in the template_content field. This means:

Images inside won't load
Scripts inside won't execute
The content is "inert" until explicitly activated

Example:

  let template = create_template () in
  let div = create_element "div" () in
  let text = create_text "Template content" in
  append_child div text;
  (* Add to template's content fragment, not children *)
  match template.template_content with
  | Some fragment -> append_child fragment div
  | None -> ()

see https://html.spec.whatwg.org/multipage/scripting.html#the-template-element
WHATWG: The template element

Node Type Predicates

Functions to test what type of node you have. Since all nodes use the same record type, these predicates check the name field to determine the actual node type.

val is_element : node -> bool

is_element node returns true if the node is an element node.

Elements are HTML tags like <div>, <p>, <a>. They are identified by having a tag name that doesn't match any of the special node name constants.

val is_text : node -> bool

is_text node returns true if the node is a text node.

Text nodes contain the character content within elements. They have name = "#text".

val is_comment : node -> bool

is_comment node returns true if the node is a comment node.

Comment nodes represent HTML comments . They have name = "#comment".

val is_document : node -> bool

is_document node returns true if the node is a document node.

The document node is the root of the DOM tree. It has name = "#document".

val is_document_fragment : node -> bool

is_document_fragment node returns true if the node is a document fragment.

Document fragments are lightweight containers. They have name = "#document-fragment".

val is_doctype : node -> bool

is_doctype node returns true if the node is a DOCTYPE node.

DOCTYPE nodes represent the <!DOCTYPE> declaration. They have name = "!doctype".

val has_children : node -> bool

has_children node returns true if the node has any children.

Note: For <template> elements, this checks the direct children list, not the template content fragment.

Tree Manipulation

Functions to modify the DOM tree structure. These functions automatically maintain parent/child references, ensuring the tree remains consistent.

val append_child : node -> node -> unit

append_child parent child adds child as the last child of parent.

The child's parent reference is updated to point to parent. If the child already has a parent, it is first removed from that parent.

Example:

  let body = create_element "body" () in
  let p = create_element "p" () in
  let text = create_text "Hello!" in
  append_child p text;
  append_child body p
  (* Result:
     body
     └── p
         └── #text "Hello!"
  *)

val insert_before : node -> node -> node -> unit

insert_before parent new_child ref_child inserts new_child before ref_child in parent's children.

parameter parent
The parent node

parameter new_child
The node to insert

parameter ref_child
The existing child to insert before

Raises Not_found if ref_child is not a child of parent.

Example:

  let ul = create_element "ul" () in
  let li1 = create_element "li" () in
  let li3 = create_element "li" () in
  append_child ul li1;
  append_child ul li3;
  let li2 = create_element "li" () in
  insert_before ul li2 li3
  (* Result: ul contains li1, li2, li3 in that order *)

val remove_child : node -> node -> unit

remove_child parent child removes child from parent's children.

The child's parent reference is set to None.

Raises Not_found if child is not a child of parent.

val insert_text_at : node -> string -> node option -> unit

insert_text_at parent text before_node inserts text content.

If before_node is None, appends at the end. If the previous sibling is a text node, the text is merged into it (text nodes are coalesced). Otherwise, a new text node is created.

This implements the HTML5 parser's text insertion algorithm which ensures adjacent text nodes are always merged, matching browser behavior.

see https://html.spec.whatwg.org/multipage/parsing.html#appropriate-place-for-inserting-a-node
WHATWG: Inserting text in the DOM

Attribute Operations

Functions to read and modify element attributes. Attributes are name-value pairs that provide additional information about elements.

In HTML5, attribute names are case-insensitive and normalized to lowercase by the parser.

see https://html.spec.whatwg.org/multipage/dom.html#attributes
WHATWG: Attributes

val get_attr : node -> string -> string option

get_attr node name returns the value of attribute name, or None if the attribute doesn't exist.

Attribute lookup is case-sensitive on the stored (lowercase) names.

val set_attr : node -> string -> string -> unit

set_attr node name value sets attribute name to value.

If the attribute already exists, it is replaced. If it doesn't exist, it is added.

val has_attr : node -> string -> bool

has_attr node name returns true if the node has attribute name.

Tree Traversal

Functions to navigate the DOM tree.

val descendants : node -> node list

descendants node returns all descendant nodes in document order.

This performs a depth-first traversal, returning children before siblings at each level. The node itself is not included.

Document order is the order nodes appear in the HTML source: parent before children, earlier siblings before later ones.

Example:

  (* For tree: div > (p > "hello", span > "world") *)
  descendants div
  (* Returns: [p; text("hello"); span; text("world")] *)

val ancestors : node -> node list

ancestors node returns all ancestor nodes from parent to root.

The first element is the immediate parent, the last is the root (usually the Document node).

Example:

  (* For a text node inside: html > body > p > text *)
  ancestors text_node
  (* Returns: [p; body; html; #document] *)

val get_text_content : node -> string

get_text_content node returns the concatenated text content.

For text nodes, returns the text data directly. For elements, recursively concatenates all descendant text content. For other node types, returns an empty string.

Example:

  (* For: <p>Hello <b>world</b>!</p> *)
  get_text_content p_element
  (* Returns: "Hello world!" *)

Cloning

val clone : ?deep:bool -> node -> node

clone ?deep node creates a copy of the node.

parameter deep
If true, recursively clone all descendants (default: false)

The cloned node has no parent. With deep:false, only the node itself is copied (with its attributes, but not its children).

Example:

  let original = create_element "div" ~attrs:[("class", "box")] () in
  let shallow = clone original in
  let deep = clone ~deep:true original

Serialization

val to_html : ?pretty:bool -> ?indent_size:int -> ?indent:int -> node -> string

to_html ?pretty ?indent_size ?indent node converts a DOM node to an HTML string.

parameter pretty
If true (default), format with indentation and newlines

parameter indent_size
Number of spaces per indentation level (default: 2)

parameter indent
Starting indentation level (default: 0)

returns
The HTML string representation of the node

val to_writer : 
  ?pretty:bool ->
  ?indent_size:int ->
  ?indent:int ->
  Bytesrw.Bytes.Writer.t ->
  node ->
  unit

to_writer ?pretty ?indent_size ?indent writer node streams a DOM node as HTML to a bytes writer.

This is more memory-efficient than to_html for large documents as it doesn't build intermediate strings.

parameter pretty
If true (default), format with indentation and newlines

parameter indent_size
Number of spaces per indentation level (default: 2)

parameter indent
Starting indentation level (default: 0)

parameter writer
The bytes writer to output to

val to_test_format : ?indent:int -> node -> string

to_test_format ?indent node converts a DOM node to the html5lib test format.

This format is used by the html5lib test suite for comparing parser output. It represents the DOM tree in a human-readable, line-based format.

parameter indent
Starting indentation level (default: 0)

returns
The test format string representation

val to_text : ?separator:string -> ?strip:bool -> node -> string

to_text ?separator ?strip node extracts all text content from a node.

Recursively collects text from all descendant text nodes.

parameter separator
String to insert between text nodes (default: " ")

parameter strip
If true (default), trim whitespace from result

returns
The concatenated text content