Html5rw.DomDOM types and manipulation functions.
This module provides the core types for representing HTML documents as DOM trees. It includes:
Dom.node type representing all kinds of DOM nodesHTML5 DOM Types and Operations
This module provides the DOM (Document Object Model) node representation used by the HTML5 parser. The DOM is a programming interface that represents an HTML document as a tree of nodes, where each node represents part of the document (an element, text content, comment, etc.).
When an HTML parser processes markup like <p>Hello <b>world</b></p>, it doesn't store the text directly. Instead, it builds a tree structure in memory:
Document
└── html
└── body
└── p
├── #text "Hello "
└── b
└── #text "world"This tree is the DOM. Each box in the tree is a node. Programs can traverse and modify this tree to read or change the document.
The HTML5 DOM includes several node types, all represented by the same record type with different field usage:
<div>, <p>, <a href="...">. Elements are the building blocks of HTML documents. They can have attributes and contain other nodes.<p>Hello</p>, "Hello" is a text node that is a child of the <p> element.<!-- comment text -->. Comments are preserved in the DOM but not rendered.<template> element contents.<!DOCTYPE html> declaration at the start of HTML5 documents. This declaration tells browsers to render the page in standards mode.HTML5 can embed content from other XML vocabularies. Elements belong to one of three namespaces:
None or implicit): Standard HTML elements like <div>, <p>, <table>. This is the default for all elements.Some "svg"): Scalable Vector Graphics for drawing. When the parser encounters an <svg> tag, all elements inside it (like <rect>, <circle>, <path>) are placed in the SVG namespace.Some "mathml"): Mathematical Markup Language for equations. When the parser encounters a <math> tag, elements inside it are placed in the MathML namespace.The parser automatically switches namespaces when entering and leaving these foreign content islands.
Nodes form a bidirectional tree: each node has a list of children and an optional parent reference. Modification functions in this module maintain these references automatically.
The tree is always well-formed: a node can only have one parent, and circular references are not possible.
type doctype_data = {name : string option;The DOCTYPE name, e.g., "html"
*)public_id : string option;Public identifier (legacy, rarely used)
*)system_id : string option;System identifier (legacy, rarely used)
*)}Information associated with a DOCTYPE node.
The document type declaration (DOCTYPE) tells browsers what version of HTML the document uses. In HTML5, the standard declaration is simply:
<!DOCTYPE html>
This minimal DOCTYPE triggers standards mode (no quirks). The DOCTYPE can optionally include a public identifier and system identifier for legacy compatibility with SGML-based tools, but these are rarely used in modern HTML5 documents.
Historical context: In HTML4 and XHTML, DOCTYPEs were verbose and referenced DTD files. For example:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
HTML5 simplified this to just <!DOCTYPE html> because:
Field meanings:
name: The document type name, almost always "html" for HTML documentspublic_id: A public identifier (legacy); None for HTML5system_id: A system identifier/URL (legacy); None for HTML5Quirks mode setting for the document.
Quirks mode is a browser rendering mode that emulates bugs and non-standard behaviors from older browsers (primarily Internet Explorer 5). Modern HTML5 documents should always render in standards mode (no quirks) for consistent, predictable behavior.
The HTML5 parser determines quirks mode based on the DOCTYPE declaration:
<!DOCTYPE html>. CSS box model, table layout, and other features work as specified.Quirks (Full quirks mode): The document renders with legacy browser bugs emulated. This happens when:
In quirks mode, many CSS properties behave differently:
Recommendation: Always use <!DOCTYPE html> at the start of HTML5 documents to ensure No_quirks mode.
type node = {mutable name : string;Tag name for elements, or special name for other node types.
For elements, this is the lowercase tag name (e.g., "div", "span"). For other node types, use the constants document_name, text_name, comment_name, etc.
mutable namespace : string option;Element namespace: None for HTML, Some "svg", Some "mathml".
Most elements are in the HTML namespace (None). The SVG and MathML namespaces are only used when content appears inside <svg> or <math> elements respectively.
mutable attrs : (string * string) list;Element attributes as (name, value) pairs.
Attributes provide additional information about elements. Common global attributes include:
id: Unique identifier for the elementclass: Space-separated list of CSS class namesstyle: Inline CSS stylestitle: Advisory text (shown as tooltip)lang: Language of the element's contenthidden: Whether the element should be hiddenElement-specific attributes include:
href on <a>: The link destination URLsrc on <img>: The image source URLtype on <input>: The input control typedisabled on form controls: Whether the control is disabledIn HTML5, attribute names are case-insensitive and are normalized to lowercase by the parser.
*)mutable children : node list;Child nodes in document order.
For most elements, this list contains the nested elements and text. For void elements (like <br>, <img>), this is always empty. For <template> elements, the actual content is in template_content, not here.
mutable parent : node option;Parent node, None for root nodes.
Every node except the Document node has a parent. This back-reference enables traversing up the tree.
*)mutable data : string;Text content for text and comment nodes.
For text nodes, this contains the actual text. For comment nodes, this contains the comment text (without the <!-- and --> delimiters). For other node types, this field is empty.
mutable template_content : node option;Document fragment for <template> element contents.
The <template> element holds "inert" content that is not rendered but can be cloned and inserted elsewhere. This field contains a document fragment with the template's content.
For non-template elements, this is None.
mutable doctype : doctype_data option;DOCTYPE information for doctype nodes.
Only doctype nodes use this field; for all other nodes it is None.
}A DOM node in the parsed document tree.
All node types use the same record structure. The name field determines the node type:
Understanding Node Fields
Different node types use different combinations of fields:
Node Type | name | namespace | attrs | data | template_content | doctype ------------------|------------------|-----------|-------|------|------------------|-------- Element | tag name | Yes | Yes | No | If <template> | No Text | "#text" | No | No | Yes | No | No Comment | "#comment" | No | No | Yes | No | No Document | "#document" | No | No | No | No | No Document Fragment | "#document-frag" | No | No | No | No | No Doctype | "!doctype" | No | No | No | No | Yes
Element Tag Names
For element nodes, the name field contains the lowercase tag name. HTML5 defines many elements with specific meanings:
Structural elements: html, head, body, header, footer, main, nav, article, section, aside
Text content: p, div, span, h1-h6, pre, blockquote
Lists: ul, ol, li, dl, dt, dd
Tables: table, tr, td, th, thead, tbody, tfoot
Forms: form, input, button, select, textarea, label
Media: img, audio, video, canvas, svg
Void Elements
Some elements are void elements - they cannot have children and have no end tag. These include: area, base, br, col, embed, hr, img, input, link, meta, source, track, wbr.
The Template Element
The <template> element is special: its children are not rendered directly but stored in a separate document fragment accessible via the template_content field. Templates are used for client-side templating where content is cloned and inserted via JavaScript.
These constants identify special node types. Compare with node.name to determine the node type.
"#document" - name for document nodes.
The Document node is the root of every HTML document tree. It represents the entire document and is the parent of the <html> element.
"#document-fragment" - name for document fragment nodes.
Document fragments are lightweight container nodes used to hold a collection of nodes without a parent document. They are used:
<template> element contents"#text" - name for text nodes.
Text nodes contain the character data within elements. When the parser encounters text between tags like <p>Hello world</p>, it creates a text node with data "Hello world" as a child of the <p> element.
Adjacent text nodes are automatically merged by the parser.
"#comment" - name for comment nodes.
Comment nodes represent HTML comments: <!-- comment text -->. Comments are preserved in the DOM but not rendered to users. They're useful for development notes or conditional content.
"!doctype" - name for doctype nodes.
The DOCTYPE node represents the <!DOCTYPE html> declaration. It is always the first child of the Document node (if present).
Functions to create new DOM nodes. All nodes start with no parent and no children. Use append_child or insert_before to build a tree.
val create_element :
string ->
?namespace:string option ->
?attrs:(string * string) list ->
unit ->
nodeCreate an element node.
Elements are the primary building blocks of HTML documents. Each element represents a component of the document with semantic meaning.
Examples:
(* Simple HTML element *)
let div = create_element "div" ()
(* Element with attributes *)
let link = create_element "a"
~attrs:[("href", "https://example.com"); ("class", "external")]
()
(* SVG element *)
let rect = create_element "rect"
~namespace:(Some "svg")
~attrs:[("width", "100"); ("height", "50"); ("fill", "blue")]
()val create_text : string -> nodeCreate a text node with the given content.
Text nodes contain the readable content of HTML documents. They appear as children of elements and represent the characters that users see.
Note: Text content is stored as-is. Character references like & should already be decoded to their character values.
Example:
let text = create_text "Hello, world!"
(* To put text in a paragraph: *)
let p = create_element "p" () in
append_child p textval create_comment : string -> nodeCreate a comment node with the given content.
Comments are human-readable notes in HTML that don't appear in the rendered output. They're written as <!-- comment --> in HTML.
Example:
let comment = create_comment " TODO: Add navigation "
(* Represents: <!-- TODO: Add navigation --> *)val create_document : unit -> nodeCreate an empty document node.
The Document node is the root of an HTML document tree. It represents the entire document and serves as the parent for the DOCTYPE (if any) and the root <html> element.
In a complete HTML document, the structure is:
#document
├── !doctype
└── html
├── head
└── bodyval create_document_fragment : unit -> nodeCreate an empty document fragment.
Document fragments are lightweight containers that can hold multiple nodes without being part of the main document tree. They're useful for:
<template> element stores its children in a document fragment, keeping them inert until clonedval create_doctype :
?name:string ->
?public_id:string ->
?system_id:string ->
unit ->
nodeCreate a DOCTYPE node.
The DOCTYPE declaration tells browsers to use standards mode for rendering. For HTML5 documents, use:
let doctype = create_doctype ~name:"html" ()
(* Represents: <!DOCTYPE html> *)Legacy example:
(* HTML 4.01 Strict DOCTYPE - not recommended for new documents *)
let legacy = create_doctype
~name:"HTML"
~public_id:"-//W3C//DTD HTML 4.01//EN"
~system_id:"http://www.w3.org/TR/html4/strict.dtd"
()val create_template :
?namespace:string option ->
?attrs:(string * string) list ->
unit ->
nodeCreate a <template> element with its content document fragment.
The <template> element holds inert HTML content that is not rendered directly. The content is stored in a separate document fragment and can be:
How templates work:
Unlike normal elements, a <template>'s children are not rendered. Instead, they're stored in the template_content field. This means:
Example:
let template = create_template () in
let div = create_element "div" () in
let text = create_text "Template content" in
append_child div text;
(* Add to template's content fragment, not children *)
match template.template_content with
| Some fragment -> append_child fragment div
| None -> ()Functions to test what type of node you have. Since all nodes use the same record type, these predicates check the name field to determine the actual node type.
val is_element : node -> boolis_element node returns true if the node is an element node.
Elements are HTML tags like <div>, <p>, <a>. They are identified by having a tag name that doesn't match any of the special node name constants.
val is_text : node -> boolis_text node returns true if the node is a text node.
Text nodes contain the character content within elements. They have name = "#text".
val is_comment : node -> boolis_comment node returns true if the node is a comment node.
Comment nodes represent HTML comments <!-- ... -->. They have name = "#comment".
val is_document : node -> boolis_document node returns true if the node is a document node.
The document node is the root of the DOM tree. It has name = "#document".
val is_document_fragment : node -> boolis_document_fragment node returns true if the node is a document fragment.
Document fragments are lightweight containers. They have name = "#document-fragment".
val is_doctype : node -> boolis_doctype node returns true if the node is a DOCTYPE node.
DOCTYPE nodes represent the <!DOCTYPE> declaration. They have name = "!doctype".
val has_children : node -> boolhas_children node returns true if the node has any children.
Note: For <template> elements, this checks the direct children list, not the template content fragment.
Functions to modify the DOM tree structure. These functions automatically maintain parent/child references, ensuring the tree remains consistent.
append_child parent child adds child as the last child of parent.
The child's parent reference is updated to point to parent. If the child already has a parent, it is first removed from that parent.
Example:
let body = create_element "body" () in
let p = create_element "p" () in
let text = create_text "Hello!" in
append_child p text;
append_child body p
(* Result:
body
└── p
└── #text "Hello!"
*)insert_before parent new_child ref_child inserts new_child before ref_child in parent's children.
Raises Not_found if ref_child is not a child of parent.
Example:
let ul = create_element "ul" () in
let li1 = create_element "li" () in
let li3 = create_element "li" () in
append_child ul li1;
append_child ul li3;
let li2 = create_element "li" () in
insert_before ul li2 li3
(* Result: ul contains li1, li2, li3 in that order *)remove_child parent child removes child from parent's children.
The child's parent reference is set to None.
Raises Not_found if child is not a child of parent.
insert_text_at parent text before_node inserts text content.
If before_node is None, appends at the end. If the previous sibling is a text node, the text is merged into it (text nodes are coalesced). Otherwise, a new text node is created.
This implements the HTML5 parser's text insertion algorithm which ensures adjacent text nodes are always merged, matching browser behavior.
Functions to read and modify element attributes. Attributes are name-value pairs that provide additional information about elements.
In HTML5, attribute names are case-insensitive and normalized to lowercase by the parser.
val get_attr : node -> string -> string optionget_attr node name returns the value of attribute name, or None if the attribute doesn't exist.
Attribute lookup is case-sensitive on the stored (lowercase) names.
val set_attr : node -> string -> string -> unitset_attr node name value sets attribute name to value.
If the attribute already exists, it is replaced. If it doesn't exist, it is added.
val has_attr : node -> string -> boolhas_attr node name returns true if the node has attribute name.
Functions to navigate the DOM tree.
descendants node returns all descendant nodes in document order.
This performs a depth-first traversal, returning children before siblings at each level. The node itself is not included.
Document order is the order nodes appear in the HTML source: parent before children, earlier siblings before later ones.
Example:
(* For tree: div > (p > "hello", span > "world") *)
descendants div
(* Returns: [p; text("hello"); span; text("world")] *)ancestors node returns all ancestor nodes from parent to root.
The first element is the immediate parent, the last is the root (usually the Document node).
Example:
(* For a text node inside: html > body > p > text *)
ancestors text_node
(* Returns: [p; body; html; #document] *)val get_text_content : node -> stringget_text_content node returns the concatenated text content.
For text nodes, returns the text data directly. For elements, recursively concatenates all descendant text content. For other node types, returns an empty string.
Example:
(* For: <p>Hello <b>world</b>!</p> *)
get_text_content p_element
(* Returns: "Hello world!" *)clone ?deep node creates a copy of the node.
The cloned node has no parent. With deep:false, only the node itself is copied (with its attributes, but not its children).
Example:
let original = create_element "div" ~attrs:[("class", "box")] () in
let shallow = clone original in
let deep = clone ~deep:true originalval to_html : ?pretty:bool -> ?indent_size:int -> ?indent:int -> node -> stringto_html ?pretty ?indent_size ?indent node converts a DOM node to an HTML string.
val to_writer :
?pretty:bool ->
?indent_size:int ->
?indent:int ->
Bytesrw.Bytes.Writer.t ->
node ->
unitto_writer ?pretty ?indent_size ?indent writer node streams a DOM node as HTML to a bytes writer.
This is more memory-efficient than to_html for large documents as it doesn't build intermediate strings.
val to_test_format : ?indent:int -> node -> stringto_test_format ?indent node converts a DOM node to the html5lib test format.
This format is used by the html5lib test suite for comparing parser output. It represents the DOM tree in a human-readable, line-based format.
val to_text : ?separator:string -> ?strip:bool -> node -> stringto_text ?separator ?strip node extracts all text content from a node.
Recursively collects text from all descendant text nodes.