Html5rwHtml5rw - Pure OCaml HTML5 Parser
This library provides a complete HTML5 parsing solution that implements the WHATWG HTML5 parsing specification. It can parse any HTML document - well-formed or not - and produce a DOM (Document Object Model) tree that matches browser behavior.
HTML (HyperText Markup Language) is the standard markup language for creating web pages. An HTML document consists of nested elements that describe the structure and content of the page:
<!DOCTYPE html>
<html>
<head>
<title>My Page</title>
</head>
<body>
<h1>Welcome</h1>
<p>Hello, <b>world</b>!</p>
</body>
</html>Each element is written with a start tag (like <p>), content, and an end tag (like </p>). Elements can have attributes that provide additional information: <a href="https://example.com">.
When this parser processes HTML, it doesn't just store the text. Instead, it builds a tree structure called the DOM (Document Object Model). Each element, text fragment, and comment becomes a node in this tree:
Document
└── html
├── head
│ └── title
│ └── #text "My Page"
└── body
├── h1
│ └── #text "Welcome"
└── p
├── #text "Hello, "
├── b
│ └── #text "world"
└── #text "!"This tree can be traversed, searched, and modified. The Dom module provides types and functions for working with DOM nodes.
Parse HTML from a string:
open Bytesrw
let reader = Bytes.Reader.of_string "<p>Hello, world!</p>" in
let result = Html5rw.parse reader in
let html = Html5rw.to_string resultParse from a file:
open Bytesrw
let ic = open_in "page.html" in
let reader = Bytes.Reader.of_in_channel ic in
let result = Html5rw.parse reader in
close_in icQuery with CSS selectors:
let result = Html5rw.parse reader in
let divs = Html5rw.query result "div.content"Unlike many parsers, HTML5 parsing never fails. The WHATWG specification defines error recovery rules for every possible malformed input, ensuring all HTML documents produce a valid DOM tree (just as browsers do).
For example, parsing <p>Hello<p>World produces two paragraphs, not an error, because <p> implicitly closes the previous <p>.
If you need to detect malformed HTML (e.g., for validation), enable error collection with ~collect_errors:true. Errors are advisory - the parsing still succeeds.
This parser implements HTML5 parsing, not XHTML parsing. Key differences:
<DIV> equals <div>)<p>Hello is valid)<br>, not <br/> or <br></br>)<input disabled>)XHTML uses stricter XML rules. If you need XHTML parsing, use an XML parser.
module Dom : sig ... endDOM types and manipulation functions.
module Tokenizer : sig ... endHTML5 tokenizer.
module Encoding : sig ... endEncoding detection and decoding.
module Selector : sig ... endCSS selector engine.
module Entities : sig ... endHTML entity decoding.
module Parser : sig ... endLow-level parser access.
type node = Dom.nodeDOM node type.
A node represents one part of an HTML document. Nodes form a tree structure with parent/child relationships. There are several kinds:
<div>, <p>, <a><!-- ... --><!DOCTYPE html> declarationSee Dom for manipulation functions.
type doctype_data = Dom.doctype_data = {name : string option;DOCTYPE name, typically "html"
public_id : string option;Public identifier for legacy DOCTYPEs (e.g., XHTML, HTML4)
*)system_id : string option;System identifier (URL) for legacy DOCTYPEs
*)}DOCTYPE information.
The DOCTYPE declaration (<!DOCTYPE html>) appears at the start of HTML documents. It tells browsers to use standards mode for rendering.
In HTML5, the DOCTYPE is minimal - just <!DOCTYPE html> with no public or system identifiers. Legacy DOCTYPEs may have additional fields.
Quirks mode as determined during parsing.
Quirks mode controls how browsers render CSS and compute layouts. It exists for backwards compatibility with old web pages that relied on browser bugs.
<!DOCTYPE html>.Recommendation: Always use <!DOCTYPE html> to ensure standards mode.
type encoding = Encoding.encoding = | Utf8UTF-8: The dominant encoding for the web, supporting all Unicode
*)| Utf16leUTF-16 Little-Endian: 16-bit encoding, used by Windows
*)| Utf16beUTF-16 Big-Endian: 16-bit encoding, network byte order
*)| Windows_1252Windows-1252 (CP-1252): Western European, superset of ISO-8859-1
*)| Iso_8859_2ISO-8859-2: Central European (Polish, Czech, Hungarian, etc.)
*)| Euc_jpEUC-JP: Extended Unix Code for Japanese
*)Character encoding detected or specified.
HTML documents are sequences of bytes that must be decoded into characters. Different encodings interpret the same bytes differently. For example:
The parser detects encoding automatically when using parse_bytes. The detected encoding is available via encoding.
type parse_error = Parser.parse_errorA parse error encountered during HTML5 parsing.
HTML5 parsing never fails - the specification defines error recovery for all malformed input. However, conformance checkers can report these errors. Enable error collection with ~collect_errors:true if you want to detect malformed HTML.
Common parse errors:
"unexpected-null-character": Null byte in the input"eof-before-tag-name": File ended while reading a tag"unexpected-character-in-attribute-name": Invalid attribute syntax"missing-doctype": Document started without <!DOCTYPE>"duplicate-attribute": Same attribute appears twice on an elementThe full list of parse error codes is defined in the WHATWG specification.
val error_code : parse_error -> stringGet the error code string.
Error codes are lowercase with hyphens, matching the WHATWG specification names. Examples: "unexpected-null-character", "eof-in-tag", "missing-end-tag-name".
val error_line : parse_error -> intGet the line number where the error occurred (1-indexed).
Line numbers count from 1 and increment at each newline character.
val error_column : parse_error -> intGet the column number where the error occurred (1-indexed).
Column numbers count from 1 and reset at each newline.
type fragment_context = Parser.fragment_contextContext element for HTML fragment parsing (innerHTML).
When parsing HTML fragments (like the innerHTML of an element), you must specify what element would contain the fragment. This affects how the parser handles certain elements.
Why context matters:
HTML parsing rules depend on where content appears. For example:
<td> is valid inside <tr> but not inside <div><li> is valid inside <ul> but creates implied lists elsewhere<table> has special parsing rulesExample:
(* Parse as if content were inside a <ul> *)
let ctx = make_fragment_context ~tag_name:"ul" () in
let result = parse ~fragment_context:ctx reader
(* Now <li> elements are parsed correctly *)val make_fragment_context :
tag_name:string ->
?namespace:string option ->
unit ->
fragment_contextCreate a fragment parsing context.
The context element determines how the parser interprets the fragment. Choose a context that matches where the fragment would be inserted.
Examples:
(* Parse as innerHTML of a <div> (most common case) *)
let ctx = make_fragment_context ~tag_name:"div" ()
(* Parse as innerHTML of a <ul> - <li> elements work correctly *)
let ctx = make_fragment_context ~tag_name:"ul" ()
(* Parse as innerHTML of an SVG <g> element *)
let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") ()
(* Parse as innerHTML of a <table> - table-specific rules apply *)
let ctx = make_fragment_context ~tag_name:"table" ()val fragment_context_tag : fragment_context -> stringGet the tag name of a fragment context.
val fragment_context_namespace : fragment_context -> string optionGet the namespace of a fragment context.
type t = {root : node;Root node of the parsed document tree.
For full document parsing, this is a Document node containing the DOCTYPE (if any) and <html> element.
For fragment parsing, this is a Document Fragment containing the parsed elements.
*)errors : parse_error list;Parse errors encountered during parsing.
This list is empty unless ~collect_errors:true was passed to the parse function. Errors are in the order they were encountered.
encoding : encoding option;Character encoding detected during parsing.
This is Some encoding when using parse_bytes with automatic encoding detection, and None when using parse (which expects pre-decoded UTF-8 input).
}val parse :
?collect_errors:bool ->
?fragment_context:fragment_context ->
Bytesrw.Bytes.Reader.t ->
tParse HTML from a Bytes.Reader.t.
This is the primary parsing function. It reads bytes from the provided reader and returns a DOM tree. The input should be valid UTF-8.
Creating readers:
open Bytesrw
(* From a string *)
let reader = Bytes.Reader.of_string html_string
(* From a file *)
let ic = open_in "page.html" in
let reader = Bytes.Reader.of_in_channel ic
(* From a buffer *)
let reader = Bytes.Reader.of_buffer bufParsing a complete document:
let result = Html5rw.parse reader
let doc = Html5rw.root resultParsing a fragment:
let ctx = Html5rw.make_fragment_context ~tag_name:"div" () in
let result = Html5rw.parse ~fragment_context:ctx readerval parse_bytes :
?collect_errors:bool ->
?transport_encoding:string ->
?fragment_context:fragment_context ->
bytes ->
tParse raw bytes with automatic encoding detection.
This function is useful when you have raw bytes and don't know the character encoding. It implements the WHATWG encoding sniffing algorithm:
1. BOM detection: Check for UTF-8, UTF-16LE, or UTF-16BE BOM 2. Prescan: Look for <meta charset="..."> in the first 1024 bytes 3. Transport hint: Use the provided transport_encoding if any 4. Fallback: Use UTF-8 (the modern web default)
The detected encoding is stored in the result's encoding field.
Example:
let bytes = really_input_bytes ic (in_channel_length ic) in
let result = Html5rw.parse_bytes bytes in
match Html5rw.encoding result with
| Some Utf8 -> print_endline "UTF-8 detected"
| Some Windows_1252 -> print_endline "Windows-1252 detected"
| _ -> ()Query the DOM tree with a CSS selector.
CSS selectors are patterns used to select elements in HTML documents. This function returns all nodes matching the selector, in document order.
Supported selectors:
Type selectors:
div, p, span - elements by tag nameClass and ID selectors:
#myid - element with id="myid".myclass - elements with class containing "myclass"Attribute selectors:
[attr] - elements with the attr attribute[attr="value"] - attribute equals value[attr~="value"] - attribute contains word[attr|="value"] - attribute starts with value or value-[attr^="value"] - attribute starts with value[attr$="value"] - attribute ends with value[attr*="value"] - attribute contains valuePseudo-classes:
:first-child, :last-child - first/last child of parent:nth-child(n) - nth child (1-indexed):only-child - only child of parent:empty - elements with no children:not(selector) - elements not matching selectorCombinators:
A B - B descendants of A (any depth)A > B - B direct children of AA + B - B immediately after A (adjacent sibling)A ~ B - B after A (general sibling)Universal:
* - all elementsExamples:
(* All paragraphs *)
let ps = query result "p"
(* Elements with class "warning" inside a div *)
let warnings = query result "div .warning"
(* Direct children of nav that are links *)
let nav_links = query result "nav > a"
(* Complex selector *)
let items = query result "ul.menu > li:first-child a[href]"val matches : node -> string -> boolCheck if a node matches a CSS selector.
This is useful for filtering nodes or implementing custom traversals.
Example:
let is_external_link node =
matches node "a[href^='http']"val to_writer :
?pretty:bool ->
?indent_size:int ->
t ->
Bytesrw.Bytes.Writer.t ->
unitWrite the DOM tree to a Bytes.Writer.t.
This serializes the DOM back to HTML. The output is valid HTML5 that can be parsed to produce an equivalent DOM tree.
Example:
open Bytesrw
let buf = Buffer.create 1024 in
let writer = Bytes.Writer.of_buffer buf in
Html5rw.to_writer result writer;
Bytes.Writer.write_eod writer;
let html = Buffer.contents bufval to_string : ?pretty:bool -> ?indent_size:int -> t -> stringSerialize the DOM tree to a string.
Convenience function that serializes to a string instead of a writer. Use to_writer for large documents to avoid memory allocation.
val to_text : ?separator:string -> ?strip:bool -> t -> stringExtract text content from the DOM tree.
This concatenates all text nodes in the document, producing a string with just the readable text (no HTML tags).
Example:
(* For document: <div><p>Hello</p><p>World</p></div> *)
let text = to_text result
(* Returns: "Hello World" *)val to_test_format : t -> stringSerialize to html5lib test format.
This produces the tree format used by the html5lib-tests suite. Mainly useful for testing the parser against the reference tests.
Get the root node of the parsed document.
For full document parsing, this returns a Document node. The structure is:
#document
├── !doctype (if present)
└── html
├── head
└── bodyFor fragment parsing, this returns a Document Fragment node containing the parsed elements directly.
val errors : t -> parse_error listGet parse errors (if error collection was enabled).
Returns an empty list if ~collect_errors:true was not passed to the parse function, or if the document was well-formed.
Errors are returned in the order they were encountered during parsing.
Get the detected encoding (if parsed from bytes).
Returns Some encoding when parse_bytes was used, indicating which encoding was detected or specified. Returns None when parse was used, since it expects pre-decoded UTF-8 input.
Common DOM operations are available directly on this module. For the full API including more advanced operations, see the Dom module.
val create_element :
string ->
?namespace:string option ->
?attrs:(string * string) list ->
unit ->
nodeCreate an element node.
Elements are the building blocks of HTML documents. They represent tags like <div>, <p>, <a>, etc.
Example:
(* Simple element *)
let div = create_element "div" ()
(* Element with attributes *)
let link = create_element "a"
~attrs:[("href", "/about"); ("class", "nav-link")]
()val create_text : string -> nodeCreate a text node.
Text nodes contain the readable text content of HTML documents.
Example:
let text = create_text "Hello, world!"val create_comment : string -> nodeCreate a comment node.
Comments are preserved in the DOM but not rendered. They're written as <!-- text --> in HTML.
val create_document : unit -> nodeCreate an empty document node.
The Document node is the root of an HTML document tree.
val create_document_fragment : unit -> nodeCreate a document fragment node.
Document fragments are lightweight containers for holding nodes without a parent document. Used for template contents and fragment parsing results.
val create_doctype :
?name:string ->
?public_id:string ->
?system_id:string ->
unit ->
nodeCreate a doctype node.
For HTML5 documents, use create_doctype ~name:"html" ().
Append a child node to a parent.
The child is added as the last child of the parent. If the child already has a parent, it is first removed from that parent.
Insert a node before a reference node.
Raises Not_found if ref_child is not a child of parent.
Remove a child node from its parent.
Raises Not_found if child is not a child of parent.
val get_attr : node -> string -> string optionGet an attribute value.
Returns Some value if the attribute exists, None otherwise. Attribute names are case-sensitive (but were lowercased during parsing).
val set_attr : node -> string -> string -> unitSet an attribute value.
If the attribute exists, it is replaced. If not, it is added.
val has_attr : node -> string -> boolCheck if a node has an attribute.
Get all descendant nodes in document order.
Returns all nodes below this node in the tree, in the order they appear in the HTML source (depth-first).
Get all ancestor nodes from parent to root.
Returns the chain of parent nodes, starting with the immediate parent and ending with the Document node.
val get_text_content : node -> stringGet text content of a node and its descendants.
For text nodes, returns the text directly. For elements, recursively concatenates all descendant text content.
Functions to test what type of node you have.
val is_element : node -> boolTest if a node is an element.
Elements are HTML tags like <div>, <p>, <a>.
val is_text : node -> boolTest if a node is a text node.
Text nodes contain character content within elements.
val is_comment : node -> boolTest if a node is a comment node.
Comment nodes represent HTML comments <!-- ... -->.
val is_document : node -> boolTest if a node is a document node.
The document node is the root of a complete HTML document tree.
val is_document_fragment : node -> boolTest if a node is a document fragment.
Document fragments are lightweight containers for nodes.
val is_doctype : node -> boolTest if a node is a doctype node.
Doctype nodes represent the <!DOCTYPE> declaration.
val has_children : node -> boolTest if a node has children.