Parser (html5rw.Html5rw.Parser)

How HTML5 Parsing Works

The HTML5 parsing algorithm is unusual compared to most parsers. It was reverse-engineered from browser behavior rather than designed from a formal grammar. This ensures the parser handles malformed HTML exactly like web browsers do.

The algorithm has three main phases:

1. Encoding Detection

Before parsing begins, the character encoding must be determined. The WHATWG specification defines a "sniffing" algorithm:

1. Check for a BOM (Byte Order Mark) at the start 2. Look for <meta charset="..."> in the first 1024 bytes 3. Use HTTP Content-Type header hint if available 4. Fall back to UTF-8

see https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
WHATWG: Determining the character encoding

2. Tokenization

The tokenizer converts the input stream into a sequence of tokens. It implements a state machine with over 80 states to handle:

Data (text content)
Tags (start tags, end tags, self-closing tags)
Comments
DOCTYPEs
Character references (&, <, <)
CDATA sections (in SVG/MathML)

The tokenizer has special handling for:

Raw text elements: <script>, <style> - no markup parsing inside
Escapable raw text elements: <textarea>, <title> - limited parsing
RCDATA: Content where only character references are parsed

see https://html.spec.whatwg.org/multipage/parsing.html#tokenization
WHATWG: Tokenization

3. Tree Construction

The tree builder receives tokens from the tokenizer and builds the DOM tree. It uses insertion modes - a state machine that determines how each token should be processed based on the current document context.

Insertion modes include:

initial: Before the DOCTYPE
before_html: Before the <html> element
before_head: Before the <head> element
in_head: Inside <head>
in_body: Inside <body> (the most complex mode)
in_table: Inside <table> (special handling)
in_template: Inside <template>
And many more...

The tree builder maintains:

Stack of open elements: Elements that have been opened but not closed
List of active formatting elements: For handling nested formatting
The template insertion mode stack: For <template> elements

see https://html.spec.whatwg.org/multipage/parsing.html#tree-construction
WHATWG: Tree construction

Error Recovery

A key feature of HTML5 parsing is that it never fails. The specification defines error recovery for every possible malformed input. For example:

Missing end tags are implicitly closed
Misnested tags are handled via the "adoption agency algorithm"
Invalid characters are replaced with U+FFFD
Unexpected elements are either ignored or moved to valid positions

This ensures every HTML document produces a valid DOM tree.

see https://html.spec.whatwg.org/multipage/parsing.html#parse-errors
WHATWG: Parse errors

The Adoption Agency Algorithm

One of the most complex parts of HTML5 parsing is handling misnested formatting elements. For example:

<p>Hello <b>world</p> <p>more</b> text</p>

Browsers don't just error out - they use the "adoption agency algorithm" to produce sensible results. This algorithm: 1. Identifies formatting elements that span across other elements 2. Reconstructs the tree to properly nest elements 3. Moves nodes between parents as needed

see https://html.spec.whatwg.org/multipage/parsing.html#adoption-agency-algorithm
WHATWG: The adoption agency algorithm

Sub-modules

module Dom = Dom

DOM types and manipulation.

module Tokenizer = Tokenizer

HTML5 tokenizer.

module Encoding = Encoding

Character encoding detection and conversion.

module Constants : sig ... end

HTML element constants and categories.

module Insertion_mode : sig ... end

Parser insertion modes.

module Tree_builder : sig ... end

Tree builder state.

Types

type parse_error

A parse error encountered during parsing.

HTML5 parsing never fails - it always produces a DOM tree. However, the WHATWG specification defines 92 specific error conditions that conformance checkers should report. These errors indicate malformed HTML that browsers will still render (with error recovery).

Error categories:

Tokenizer errors (detected during tokenization):

abrupt-closing-of-empty-comment: Comment closed with --> without content
abrupt-doctype-public-identifier: DOCTYPE public ID ended unexpectedly
eof-before-tag-name: End of file while reading a tag name
eof-in-tag: End of file inside a tag
missing-attribute-value: Attribute has = but no value
unexpected-null-character: Null byte in the input
unexpected-question-mark-instead-of-tag-name: <? used instead of <!

Tree construction errors (detected during tree building):

missing-doctype: No DOCTYPE before first element
unexpected-token-*: Token appeared in wrong context
foster-parenting: Content moved outside table due to invalid position

Enable error collection with ~collect_errors:true. Error collection has some performance overhead, so it's disabled by default.

see https://html.spec.whatwg.org/multipage/parsing.html#parse-errors
WHATWG: Complete list of parse errors

val error_code : parse_error -> string

Get the error code string.

Error codes are lowercase with hyphens, exactly matching the WHATWG specification naming. Examples:

"unexpected-null-character"
"eof-before-tag-name"
"missing-end-tag-name"
"duplicate-attribute"
"missing-doctype"

see https://html.spec.whatwg.org/multipage/parsing.html#parse-errors
WHATWG: Parse error codes

val error_line : parse_error -> int

Get the line number where the error occurred.

Line numbers are 1-indexed (first line is 1). Line breaks are detected at LF (U+000A), CR (U+000D), and CR+LF sequences.

val error_column : parse_error -> int

Get the column number where the error occurred.

Column numbers are 1-indexed (first column is 1). Columns reset to 1 after each line break. Column counting uses code points, not bytes or grapheme clusters.

type fragment_context

Context element for HTML fragment parsing.

When parsing HTML fragments (the content that would be assigned to an element's innerHTML), the parser needs to know what element would contain the fragment. This affects parsing in several ways:

Parser state initialization:

For <title> or <textarea>: Tokenizer starts in RCDATA state
For <style>, <xmp>, <iframe>, <noembed>, <noframes>: Tokenizer starts in RAWTEXT state
For <script>: Tokenizer starts in script data state
For <noscript>: Tokenizer starts in RAWTEXT state (if scripting enabled)
For <plaintext>: Tokenizer starts in PLAINTEXT state
Otherwise: Tokenizer starts in data state

Insertion mode: The initial insertion mode depends on the context element:

<template>: "in template" mode
<html>: "before head" mode
<head>: "in head" mode
<body>, <div>, etc.: "in body" mode
<table>: "in table" mode
And so on...

see https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments
WHATWG: The fragment parsing algorithm

val make_fragment_context : 
  tag_name:string ->
  ?namespace:string option ->
  unit ->
  fragment_context

Create a fragment parsing context.

parameter tag_name
Tag name of the context element. This should be the tag name of the element that would contain the fragment. Common choices:
- "div": General-purpose (most common)
- "body": For full body content
- "tr": For table row content (<td> elements)
- "ul", "ol": For list content (<li> elements)
- "select": For <option> elements

parameter namespace
Element namespace:
- None: HTML namespace (default)
- Some "svg": SVG namespace
- Some "mathml": MathML namespace

Examples:

  (* Parse innerHTML of a table row - <td> works correctly *)
  let ctx = make_fragment_context ~tag_name:"tr" ()

  (* Parse innerHTML of an SVG group element *)
  let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") ()

  (* Parse innerHTML of a select element - <option> works correctly *)
  let ctx = make_fragment_context ~tag_name:"select" ()

see https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments
WHATWG: Fragment parsing algorithm

val fragment_context_tag : fragment_context -> string

Get the tag name of a fragment context.

val fragment_context_namespace : fragment_context -> string option

Get the namespace of a fragment context (None for HTML).

type t

Result of parsing an HTML document or fragment.

This opaque type contains:

The DOM tree (access via root)
Parse errors if collection was enabled (access via errors)
Detected encoding for byte input (access via encoding)

Parsing Functions

val parse : 
  ?collect_errors:bool ->
  ?fragment_context:fragment_context ->
  Bytesrw.Bytes.Reader.t ->
  t

Parse HTML from a byte stream reader.

This function implements the complete HTML5 parsing algorithm:

1. Reads bytes from the provided reader 2. Tokenizes the input into HTML tokens 3. Constructs a DOM tree using the tree construction algorithm 4. Returns the parsed result

The input should be valid UTF-8. For automatic encoding detection from raw bytes, use parse_bytes instead.

Parser behavior:

For full document parsing (no fragment context), the parser:

Creates a Document node as the root
Processes any DOCTYPE declaration
Creates <html>, <head>, and <body> elements as needed
Builds the full document tree

For fragment parsing (with fragment context), the parser:

Creates a Document Fragment as the root
Initializes tokenizer state based on context element
Initializes insertion mode based on context element
Does not create implicit <html>, <head>, <body>

parameter collect_errors
If true, collect parse errors in the result. Default: false. Enabling error collection adds overhead.

parameter fragment_context
Context for fragment parsing. If provided, the input is parsed as fragment content (like innerHTML).

see https://html.spec.whatwg.org/multipage/parsing.html
WHATWG: HTML parsing

val parse_bytes : 
  ?collect_errors:bool ->
  ?transport_encoding:string ->
  ?fragment_context:fragment_context ->
  bytes ->
  t

Parse HTML bytes with automatic encoding detection.

This function wraps parse with encoding detection, implementing the WHATWG encoding sniffing algorithm:

Detection order: 1. BOM: Check first 2-3 bytes for UTF-8, UTF-16LE, or UTF-16BE BOM 2. Prescan: Look for <meta charset="..."> or <meta http-equiv="Content-Type" content="...charset=..."> in the first 1024 bytes 3. Transport hint: Use transport_encoding if provided 4. Fallback: Use UTF-8

The detected encoding is stored in the result (access via encoding).

Prescan details:

The prescan algorithm parses just enough of the document to find a charset declaration. It handles:

<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Comments and other markup are skipped
Parsing stops after 1024 bytes

parameter collect_errors
If true, collect parse errors. Default: false.

parameter transport_encoding
Encoding hint from HTTP Content-Type header. For example: "utf-8", "iso-8859-1", "windows-1252".

parameter fragment_context
Context for fragment parsing.

see https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
WHATWG: Determining the character encoding

see https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding
WHATWG: Prescan algorithm

Result Accessors

val root : t -> Dom.node

Get the root node of the parsed document.

For full document parsing, returns a Document node with structure:

#document
├── !doctype (if DOCTYPE was present)
└── html
    ├── head
    │   └── ... (title, meta, link, script, style)
    └── body
        └── ... (page content)

For fragment parsing, returns a Document Fragment node containing the parsed elements directly (no implicit html/head/body).

see https://html.spec.whatwg.org/multipage/dom.html#document
WHATWG: The Document object

val errors : t -> parse_error list

Get parse errors collected during parsing.

Returns an empty list if error collection was not enabled (collect_errors:false or omitted) or if the document was well-formed.

Errors are returned in the order they were encountered.

Example:

  let result = parse ~collect_errors:true reader in
  List.iter (fun e ->
    Printf.printf "Line %d, col %d: %s\n"
      (error_line e) (error_column e) (error_code e)
  ) (errors result)

see https://html.spec.whatwg.org/multipage/parsing.html#parse-errors
WHATWG: Parse errors

val encoding : t -> Encoding.encoding option

Get the detected encoding.

Returns Some encoding when parse_bytes was used, indicating which encoding was detected or specified.

Returns None when parse was used (it expects pre-decoded UTF-8).

see https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
WHATWG: Determining the character encoding

Querying

val query : t -> string -> Dom.node list

Query the DOM with a CSS selector.

Returns all elements matching the selector in document order.

Supported selectors:

See Selector for the complete list. Key selectors include:

Type: div, p, a
ID: #myid
Class: .myclass
Attribute: [href], [type="text"]
Pseudo-class: :first-child, :nth-child(2)
Combinators: div p (descendant), div > p (child)

raises Selector.Selector_error
if the selector is invalid

see https://www.w3.org/TR/selectors-4/
W3C: Selectors Level 4

Serialization

val to_writer : 
  ?pretty:bool ->
  ?indent_size:int ->
  t ->
  Bytesrw.Bytes.Writer.t ->
  unit

Serialize the DOM tree to a byte writer.

Outputs valid HTML5 that can be parsed to produce an equivalent DOM tree. The output follows the WHATWG serialization algorithm.

Serialization rules:

Void elements are written without end tags
Attributes are quoted with double quotes
Special characters in text/attributes are escaped
Comments preserve their content
DOCTYPE is serialized as <!DOCTYPE html>

parameter pretty
If true (default), add indentation for readability.

parameter indent_size
Spaces per indent level (default: 2).

see https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments
WHATWG: Serialising HTML fragments

val to_string : ?pretty:bool -> ?indent_size:int -> t -> string

Serialize the DOM tree to a string.

Convenience wrapper around to_writer that returns a string.

parameter pretty
If true (default), add indentation for readability.

parameter indent_size
Spaces per indent level (default: 2).

val to_text : ?separator:string -> ?strip:bool -> t -> string

Extract text content from the DOM tree.

Returns the concatenation of all text node content in document order, with no HTML markup.

parameter separator
String to insert between text nodes (default: " ")

parameter strip
If true (default), trim leading/trailing whitespace

val to_test_format : t -> string

Serialize to html5lib test format.

This produces the tree representation format used by the html5lib-tests suite.

The format shows the tree structure with:

Indentation indicating depth (2 spaces per level)
Prefixes indicating node type:
<!DOCTYPE ...> for DOCTYPE
<tagname> for elements (with attributes on same line)
"text" for text nodes
 for comments

Mainly useful for testing the parser against the reference test suite.