Html5rw.ParserLow-level parser access.
This module exposes the internals of the HTML5 parser for advanced use. Most users should use the top-level parse function instead.
The parser exposes:
HTML5 Parser - Low-Level API
This module provides the core HTML5 parsing functionality implementing the WHATWG HTML5 parsing specification. It handles tokenization, tree construction, error recovery, and produces a DOM tree.
For most uses, prefer the top-level Html5rw module which provides a simpler interface. This module is for advanced use cases that need access to parser internals.
The HTML5 parsing algorithm is unusual compared to most parsers. It was reverse-engineered from browser behavior rather than designed from a formal grammar. This ensures the parser handles malformed HTML exactly like web browsers do.
The algorithm has three main phases:
Before parsing begins, the character encoding must be determined. The WHATWG specification defines a "sniffing" algorithm:
1. Check for a BOM (Byte Order Mark) at the start 2. Look for <meta charset="..."> in the first 1024 bytes 3. Use HTTP Content-Type header hint if available 4. Fall back to UTF-8
The tokenizer converts the input stream into a sequence of tokens. It implements a state machine with over 80 states to handle:
&, <, <)The tokenizer has special handling for:
<script>, <style> - no markup parsing inside<textarea>, <title> - limited parsingThe tree builder receives tokens from the tokenizer and builds the DOM tree. It uses insertion modes - a state machine that determines how each token should be processed based on the current document context.
Insertion modes include:
initial: Before the DOCTYPEbefore_html: Before the <html> elementbefore_head: Before the <head> elementin_head: Inside <head>in_body: Inside <body> (the most complex mode)in_table: Inside <table> (special handling)in_template: Inside <template>The tree builder maintains:
<template> elementsA key feature of HTML5 parsing is that it never fails. The specification defines error recovery for every possible malformed input. For example:
This ensures every HTML document produces a valid DOM tree.
One of the most complex parts of HTML5 parsing is handling misnested formatting elements. For example:
<p>Hello <b>world</p> <p>more</b> text</p>
Browsers don't just error out - they use the "adoption agency algorithm" to produce sensible results. This algorithm: 1. Identifies formatting elements that span across other elements 2. Reconstructs the tree to properly nest elements 3. Moves nodes between parents as needed
module Dom = DomDOM types and manipulation.
module Tokenizer = TokenizerHTML5 tokenizer.
module Encoding = EncodingCharacter encoding detection and conversion.
module Constants : sig ... endHTML element constants and categories.
module Insertion_mode : sig ... endParser insertion modes.
module Tree_builder : sig ... endTree builder state.
A parse error encountered during parsing.
HTML5 parsing never fails - it always produces a DOM tree. However, the WHATWG specification defines 92 specific error conditions that conformance checkers should report. These errors indicate malformed HTML that browsers will still render (with error recovery).
Error categories:
Tokenizer errors (detected during tokenization):
abrupt-closing-of-empty-comment: Comment closed with --> without contentabrupt-doctype-public-identifier: DOCTYPE public ID ended unexpectedlyeof-before-tag-name: End of file while reading a tag nameeof-in-tag: End of file inside a tagmissing-attribute-value: Attribute has = but no valueunexpected-null-character: Null byte in the inputunexpected-question-mark-instead-of-tag-name: <? used instead of <!Tree construction errors (detected during tree building):
missing-doctype: No DOCTYPE before first elementunexpected-token-*: Token appeared in wrong contextfoster-parenting: Content moved outside table due to invalid positionEnable error collection with ~collect_errors:true. Error collection has some performance overhead, so it's disabled by default.
val error_code : parse_error -> stringGet the error code string.
Error codes are lowercase with hyphens, exactly matching the WHATWG specification naming. Examples:
"unexpected-null-character""eof-before-tag-name""missing-end-tag-name""duplicate-attribute""missing-doctype"val error_line : parse_error -> intGet the line number where the error occurred.
Line numbers are 1-indexed (first line is 1). Line breaks are detected at LF (U+000A), CR (U+000D), and CR+LF sequences.
val error_column : parse_error -> intGet the column number where the error occurred.
Column numbers are 1-indexed (first column is 1). Columns reset to 1 after each line break. Column counting uses code points, not bytes or grapheme clusters.
Context element for HTML fragment parsing.
When parsing HTML fragments (the content that would be assigned to an element's innerHTML), the parser needs to know what element would contain the fragment. This affects parsing in several ways:
Parser state initialization:
<title> or <textarea>: Tokenizer starts in RCDATA state<style>, <xmp>, <iframe>, <noembed>, <noframes>: Tokenizer starts in RAWTEXT state<script>: Tokenizer starts in script data state<noscript>: Tokenizer starts in RAWTEXT state (if scripting enabled)<plaintext>: Tokenizer starts in PLAINTEXT stateInsertion mode: The initial insertion mode depends on the context element:
<template>: "in template" mode<html>: "before head" mode<head>: "in head" mode<body>, <div>, etc.: "in body" mode<table>: "in table" modeval make_fragment_context :
tag_name:string ->
?namespace:string option ->
unit ->
fragment_contextCreate a fragment parsing context.
Examples:
(* Parse innerHTML of a table row - <td> works correctly *)
let ctx = make_fragment_context ~tag_name:"tr" ()
(* Parse innerHTML of an SVG group element *)
let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") ()
(* Parse innerHTML of a select element - <option> works correctly *)
let ctx = make_fragment_context ~tag_name:"select" ()val fragment_context_tag : fragment_context -> stringGet the tag name of a fragment context.
val fragment_context_namespace : fragment_context -> string optionGet the namespace of a fragment context (None for HTML).
val parse :
?collect_errors:bool ->
?fragment_context:fragment_context ->
Bytesrw.Bytes.Reader.t ->
tParse HTML from a byte stream reader.
This function implements the complete HTML5 parsing algorithm:
1. Reads bytes from the provided reader 2. Tokenizes the input into HTML tokens 3. Constructs a DOM tree using the tree construction algorithm 4. Returns the parsed result
The input should be valid UTF-8. For automatic encoding detection from raw bytes, use parse_bytes instead.
Parser behavior:
For full document parsing (no fragment context), the parser:
<html>, <head>, and <body> elements as neededFor fragment parsing (with fragment context), the parser:
<html>, <head>, <body>val parse_bytes :
?collect_errors:bool ->
?transport_encoding:string ->
?fragment_context:fragment_context ->
bytes ->
tParse HTML bytes with automatic encoding detection.
This function wraps parse with encoding detection, implementing the WHATWG encoding sniffing algorithm:
Detection order: 1. BOM: Check first 2-3 bytes for UTF-8, UTF-16LE, or UTF-16BE BOM 2. Prescan: Look for <meta charset="..."> or <meta http-equiv="Content-Type" content="...charset=..."> in the first 1024 bytes 3. Transport hint: Use transport_encoding if provided 4. Fallback: Use UTF-8
The detected encoding is stored in the result (access via encoding).
Prescan details:
The prescan algorithm parses just enough of the document to find a charset declaration. It handles:
<meta charset="utf-8"><meta http-equiv="Content-Type" content="text/html; charset=utf-8">Get the root node of the parsed document.
For full document parsing, returns a Document node with structure:
#document
├── !doctype (if DOCTYPE was present)
└── html
├── head
│ └── ... (title, meta, link, script, style)
└── body
└── ... (page content)For fragment parsing, returns a Document Fragment node containing the parsed elements directly (no implicit html/head/body).
val errors : t -> parse_error listGet parse errors collected during parsing.
Returns an empty list if error collection was not enabled (collect_errors:false or omitted) or if the document was well-formed.
Errors are returned in the order they were encountered.
Example:
let result = parse ~collect_errors:true reader in
List.iter (fun e ->
Printf.printf "Line %d, col %d: %s\n"
(error_line e) (error_column e) (error_code e)
) (errors result)val encoding : t -> Encoding.encoding optionGet the detected encoding.
Returns Some encoding when parse_bytes was used, indicating which encoding was detected or specified.
Returns None when parse was used (it expects pre-decoded UTF-8).
Query the DOM with a CSS selector.
Returns all elements matching the selector in document order.
Supported selectors:
See Selector for the complete list. Key selectors include:
div, p, a#myid.myclass[href], [type="text"]:first-child, :nth-child(2)div p (descendant), div > p (child)val to_writer :
?pretty:bool ->
?indent_size:int ->
t ->
Bytesrw.Bytes.Writer.t ->
unitSerialize the DOM tree to a byte writer.
Outputs valid HTML5 that can be parsed to produce an equivalent DOM tree. The output follows the WHATWG serialization algorithm.
Serialization rules:
<!DOCTYPE html>val to_string : ?pretty:bool -> ?indent_size:int -> t -> stringSerialize the DOM tree to a string.
Convenience wrapper around to_writer that returns a string.
val to_text : ?separator:string -> ?strip:bool -> t -> stringExtract text content from the DOM tree.
Returns the concatenation of all text node content in document order, with no HTML markup.
val to_test_format : t -> stringSerialize to html5lib test format.
This produces the tree representation format used by the html5lib-tests suite.
The format shows the tree structure with:
<!DOCTYPE ...> for DOCTYPE<tagname> for elements (with attributes on same line)"text" for text nodes<!-- comment --> for commentsMainly useful for testing the parser against the reference test suite.