Module Html5rw.Parser

Low-level parser access.

This module exposes the internals of the HTML5 parser for advanced use. Most users should use the top-level parse function instead.

The parser exposes:

HTML5 Parser - Low-Level API

This module provides the core HTML5 parsing functionality implementing the WHATWG HTML5 parsing specification. It handles tokenization, tree construction, error recovery, and produces a DOM tree.

For most uses, prefer the top-level Html5rw module which provides a simpler interface. This module is for advanced use cases that need access to parser internals.

How HTML5 Parsing Works

The HTML5 parsing algorithm is unusual compared to most parsers. It was reverse-engineered from browser behavior rather than designed from a formal grammar. This ensures the parser handles malformed HTML exactly like web browsers do.

The algorithm has three main phases:

1. Encoding Detection

Before parsing begins, the character encoding must be determined. The WHATWG specification defines a "sniffing" algorithm:

1. Check for a BOM (Byte Order Mark) at the start 2. Look for <meta charset="..."> in the first 1024 bytes 3. Use HTTP Content-Type header hint if available 4. Fall back to UTF-8

2. Tokenization

The tokenizer converts the input stream into a sequence of tokens. It implements a state machine with over 80 states to handle:

The tokenizer has special handling for:

3. Tree Construction

The tree builder receives tokens from the tokenizer and builds the DOM tree. It uses insertion modes - a state machine that determines how each token should be processed based on the current document context.

Insertion modes include:

The tree builder maintains:

Error Recovery

A key feature of HTML5 parsing is that it never fails. The specification defines error recovery for every possible malformed input. For example:

This ensures every HTML document produces a valid DOM tree.

The Adoption Agency Algorithm

One of the most complex parts of HTML5 parsing is handling misnested formatting elements. For example:

<p>Hello <b>world</p> <p>more</b> text</p>

Browsers don't just error out - they use the "adoption agency algorithm" to produce sensible results. This algorithm: 1. Identifies formatting elements that span across other elements 2. Reconstructs the tree to properly nest elements 3. Moves nodes between parents as needed

Sub-modules

module Dom = Dom

DOM types and manipulation.

module Tokenizer = Tokenizer

HTML5 tokenizer.

module Encoding = Encoding

Character encoding detection and conversion.

module Constants : sig ... end

HTML element constants and categories.

module Insertion_mode : sig ... end

Parser insertion modes.

module Tree_builder : sig ... end

Tree builder state.

Types

type parse_error

A parse error encountered during parsing.

HTML5 parsing never fails - it always produces a DOM tree. However, the WHATWG specification defines 92 specific error conditions that conformance checkers should report. These errors indicate malformed HTML that browsers will still render (with error recovery).

Error categories:

Tokenizer errors (detected during tokenization):

  • abrupt-closing-of-empty-comment: Comment closed with --> without content
  • abrupt-doctype-public-identifier: DOCTYPE public ID ended unexpectedly
  • eof-before-tag-name: End of file while reading a tag name
  • eof-in-tag: End of file inside a tag
  • missing-attribute-value: Attribute has = but no value
  • unexpected-null-character: Null byte in the input
  • unexpected-question-mark-instead-of-tag-name: <? used instead of <!

Tree construction errors (detected during tree building):

  • missing-doctype: No DOCTYPE before first element
  • unexpected-token-*: Token appeared in wrong context
  • foster-parenting: Content moved outside table due to invalid position

Enable error collection with ~collect_errors:true. Error collection has some performance overhead, so it's disabled by default.

val error_code : parse_error -> string

Get the error code string.

Error codes are lowercase with hyphens, exactly matching the WHATWG specification naming. Examples:

  • "unexpected-null-character"
  • "eof-before-tag-name"
  • "missing-end-tag-name"
  • "duplicate-attribute"
  • "missing-doctype"
val error_line : parse_error -> int

Get the line number where the error occurred.

Line numbers are 1-indexed (first line is 1). Line breaks are detected at LF (U+000A), CR (U+000D), and CR+LF sequences.

val error_column : parse_error -> int

Get the column number where the error occurred.

Column numbers are 1-indexed (first column is 1). Columns reset to 1 after each line break. Column counting uses code points, not bytes or grapheme clusters.

type fragment_context

Context element for HTML fragment parsing.

When parsing HTML fragments (the content that would be assigned to an element's innerHTML), the parser needs to know what element would contain the fragment. This affects parsing in several ways:

Parser state initialization:

  • For <title> or <textarea>: Tokenizer starts in RCDATA state
  • For <style>, <xmp>, <iframe>, <noembed>, <noframes>: Tokenizer starts in RAWTEXT state
  • For <script>: Tokenizer starts in script data state
  • For <noscript>: Tokenizer starts in RAWTEXT state (if scripting enabled)
  • For <plaintext>: Tokenizer starts in PLAINTEXT state
  • Otherwise: Tokenizer starts in data state

Insertion mode: The initial insertion mode depends on the context element:

  • <template>: "in template" mode
  • <html>: "before head" mode
  • <head>: "in head" mode
  • <body>, <div>, etc.: "in body" mode
  • <table>: "in table" mode
  • And so on...
val make_fragment_context : tag_name:string -> ?namespace:string option -> unit -> fragment_context

Create a fragment parsing context.

  • parameter tag_name

    Tag name of the context element. This should be the tag name of the element that would contain the fragment. Common choices:

    • "div": General-purpose (most common)
    • "body": For full body content
    • "tr": For table row content (<td> elements)
    • "ul", "ol": For list content (<li> elements)
    • "select": For <option> elements
  • parameter namespace

    Element namespace:

    • None: HTML namespace (default)
    • Some "svg": SVG namespace
    • Some "mathml": MathML namespace

Examples:

  (* Parse innerHTML of a table row - <td> works correctly *)
  let ctx = make_fragment_context ~tag_name:"tr" ()

  (* Parse innerHTML of an SVG group element *)
  let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") ()

  (* Parse innerHTML of a select element - <option> works correctly *)
  let ctx = make_fragment_context ~tag_name:"select" ()
val fragment_context_tag : fragment_context -> string

Get the tag name of a fragment context.

val fragment_context_namespace : fragment_context -> string option

Get the namespace of a fragment context (None for HTML).

type t

Result of parsing an HTML document or fragment.

This opaque type contains:

  • The DOM tree (access via root)
  • Parse errors if collection was enabled (access via errors)
  • Detected encoding for byte input (access via encoding)

Parsing Functions

val parse : ?collect_errors:bool -> ?fragment_context:fragment_context -> Bytesrw.Bytes.Reader.t -> t

Parse HTML from a byte stream reader.

This function implements the complete HTML5 parsing algorithm:

1. Reads bytes from the provided reader 2. Tokenizes the input into HTML tokens 3. Constructs a DOM tree using the tree construction algorithm 4. Returns the parsed result

The input should be valid UTF-8. For automatic encoding detection from raw bytes, use parse_bytes instead.

Parser behavior:

For full document parsing (no fragment context), the parser:

  • Creates a Document node as the root
  • Processes any DOCTYPE declaration
  • Creates <html>, <head>, and <body> elements as needed
  • Builds the full document tree

For fragment parsing (with fragment context), the parser:

  • Creates a Document Fragment as the root
  • Initializes tokenizer state based on context element
  • Initializes insertion mode based on context element
  • Does not create implicit <html>, <head>, <body>
  • parameter collect_errors

    If true, collect parse errors in the result. Default: false. Enabling error collection adds overhead.

  • parameter fragment_context

    Context for fragment parsing. If provided, the input is parsed as fragment content (like innerHTML).

val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string -> ?fragment_context:fragment_context -> bytes -> t

Parse HTML bytes with automatic encoding detection.

This function wraps parse with encoding detection, implementing the WHATWG encoding sniffing algorithm:

Detection order: 1. BOM: Check first 2-3 bytes for UTF-8, UTF-16LE, or UTF-16BE BOM 2. Prescan: Look for <meta charset="..."> or <meta http-equiv="Content-Type" content="...charset=..."> in the first 1024 bytes 3. Transport hint: Use transport_encoding if provided 4. Fallback: Use UTF-8

The detected encoding is stored in the result (access via encoding).

Prescan details:

The prescan algorithm parses just enough of the document to find a charset declaration. It handles:

  • <meta charset="utf-8">
  • <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  • Comments and other markup are skipped
  • Parsing stops after 1024 bytes
  • parameter collect_errors

    If true, collect parse errors. Default: false.

  • parameter transport_encoding

    Encoding hint from HTTP Content-Type header. For example: "utf-8", "iso-8859-1", "windows-1252".

  • parameter fragment_context

    Context for fragment parsing.

Result Accessors

val root : t -> Dom.node

Get the root node of the parsed document.

For full document parsing, returns a Document node with structure:

#document
├── !doctype (if DOCTYPE was present)
└── html
    ├── head
    │   └── ... (title, meta, link, script, style)
    └── body
        └── ... (page content)

For fragment parsing, returns a Document Fragment node containing the parsed elements directly (no implicit html/head/body).

val errors : t -> parse_error list

Get parse errors collected during parsing.

Returns an empty list if error collection was not enabled (collect_errors:false or omitted) or if the document was well-formed.

Errors are returned in the order they were encountered.

Example:

  let result = parse ~collect_errors:true reader in
  List.iter (fun e ->
    Printf.printf "Line %d, col %d: %s\n"
      (error_line e) (error_column e) (error_code e)
  ) (errors result)
val encoding : t -> Encoding.encoding option

Get the detected encoding.

Returns Some encoding when parse_bytes was used, indicating which encoding was detected or specified.

Returns None when parse was used (it expects pre-decoded UTF-8).

Querying

val query : t -> string -> Dom.node list

Query the DOM with a CSS selector.

Returns all elements matching the selector in document order.

Supported selectors:

See Selector for the complete list. Key selectors include:

  • Type: div, p, a
  • ID: #myid
  • Class: .myclass
  • Attribute: [href], [type="text"]
  • Pseudo-class: :first-child, :nth-child(2)
  • Combinators: div p (descendant), div > p (child)

Serialization

val to_writer : ?pretty:bool -> ?indent_size:int -> t -> Bytesrw.Bytes.Writer.t -> unit

Serialize the DOM tree to a byte writer.

Outputs valid HTML5 that can be parsed to produce an equivalent DOM tree. The output follows the WHATWG serialization algorithm.

Serialization rules:

  • Void elements are written without end tags
  • Attributes are quoted with double quotes
  • Special characters in text/attributes are escaped
  • Comments preserve their content
  • DOCTYPE is serialized as <!DOCTYPE html>
  • parameter pretty

    If true (default), add indentation for readability.

  • parameter indent_size

    Spaces per indent level (default: 2).

val to_string : ?pretty:bool -> ?indent_size:int -> t -> string

Serialize the DOM tree to a string.

Convenience wrapper around to_writer that returns a string.

  • parameter pretty

    If true (default), add indentation for readability.

  • parameter indent_size

    Spaces per indent level (default: 2).

val to_text : ?separator:string -> ?strip:bool -> t -> string

Extract text content from the DOM tree.

Returns the concatenation of all text node content in document order, with no HTML markup.

  • parameter separator

    String to insert between text nodes (default: " ")

  • parameter strip

    If true (default), trim leading/trailing whitespace

val to_test_format : t -> string

Serialize to html5lib test format.

This produces the tree representation format used by the html5lib-tests suite.

The format shows the tree structure with:

  • Indentation indicating depth (2 spaces per level)
  • Prefixes indicating node type:
  • <!DOCTYPE ...> for DOCTYPE
  • <tagname> for elements (with attributes on same line)
  • "text" for text nodes
  • <!-- comment --> for comments

Mainly useful for testing the parser against the reference test suite.