Module Html5rw

Html5rw - Pure OCaml HTML5 Parser

This library provides a complete HTML5 parsing solution that implements the WHATWG HTML5 parsing specification. It can parse any HTML document - well-formed or not - and produce a DOM (Document Object Model) tree that matches browser behavior.

What is HTML?

HTML (HyperText Markup Language) is the standard markup language for creating web pages. An HTML document consists of nested elements that describe the structure and content of the page:

<!DOCTYPE html>
<html>
  <head>
    <title>My Page</title>
  </head>
  <body>
    <h1>Welcome</h1>
    <p>Hello, <b>world</b>!</p>
  </body>
</html>

Each element is written with a start tag (like <p>), content, and an end tag (like </p>). Elements can have attributes that provide additional information: <a href="https://example.com">.

The DOM

When this parser processes HTML, it doesn't just store the text. Instead, it builds a tree structure called the DOM (Document Object Model). Each element, text fragment, and comment becomes a node in this tree:

Document
└── html
    ├── head
    │   └── title
    │       └── #text "My Page"
    └── body
        ├── h1
        │   └── #text "Welcome"
        └── p
            ├── #text "Hello, "
            ├── b
            │   └── #text "world"
            └── #text "!"

This tree can be traversed, searched, and modified. The Dom module provides types and functions for working with DOM nodes.

Quick Start

Parse HTML from a string:

  open Bytesrw
  let reader = Bytes.Reader.of_string "<p>Hello, world!</p>" in
  let result = Html5rw.parse reader in
  let html = Html5rw.to_string result

Parse from a file:

  open Bytesrw
  let ic = open_in "page.html" in
  let reader = Bytes.Reader.of_in_channel ic in
  let result = Html5rw.parse reader in
  close_in ic

Query with CSS selectors:

  let result = Html5rw.parse reader in
  let divs = Html5rw.query result "div.content"

Error Handling

Unlike many parsers, HTML5 parsing never fails. The WHATWG specification defines error recovery rules for every possible malformed input, ensuring all HTML documents produce a valid DOM tree (just as browsers do).

For example, parsing <p>Hello<p>World produces two paragraphs, not an error, because <p> implicitly closes the previous <p>.

If you need to detect malformed HTML (e.g., for validation), enable error collection with ~collect_errors:true. Errors are advisory - the parsing still succeeds.

HTML vs XHTML

This parser implements HTML5 parsing, not XHTML parsing. Key differences:

XHTML uses stricter XML rules. If you need XHTML parsing, use an XML parser.

Sub-modules

module Dom : sig ... end

DOM types and manipulation functions.

module Tokenizer : sig ... end

HTML5 tokenizer.

module Encoding : sig ... end

Encoding detection and decoding.

module Selector : sig ... end

CSS selector engine.

module Entities : sig ... end

HTML entity decoding.

module Parser : sig ... end

Low-level parser access.

Core Types

type node = Dom.node

DOM node type.

A node represents one part of an HTML document. Nodes form a tree structure with parent/child relationships. There are several kinds:

  • Element nodes: HTML tags like <div>, <p>, <a>
  • Text nodes: Text content within elements
  • Comment nodes: HTML comments <!-- ... -->
  • Document nodes: The root of a document tree
  • Document fragment nodes: Lightweight containers
  • Doctype nodes: The <!DOCTYPE html> declaration

See Dom for manipulation functions.

type doctype_data = Dom.doctype_data = {
  1. name : string option;
    (*

    DOCTYPE name, typically "html"

    *)
  2. public_id : string option;
    (*

    Public identifier for legacy DOCTYPEs (e.g., XHTML, HTML4)

    *)
  3. system_id : string option;
    (*

    System identifier (URL) for legacy DOCTYPEs

    *)
}

DOCTYPE information.

The DOCTYPE declaration (<!DOCTYPE html>) appears at the start of HTML documents. It tells browsers to use standards mode for rendering.

In HTML5, the DOCTYPE is minimal - just <!DOCTYPE html> with no public or system identifiers. Legacy DOCTYPEs may have additional fields.

type quirks_mode = Dom.quirks_mode =
  1. | No_quirks
  2. | Quirks
  3. | Limited_quirks

Quirks mode as determined during parsing.

Quirks mode controls how browsers render CSS and compute layouts. It exists for backwards compatibility with old web pages that relied on browser bugs.

  • No_quirks: Standards mode. The document is rendered according to modern HTML5 and CSS specifications. Triggered by <!DOCTYPE html>.
  • Quirks: Full quirks mode. The browser emulates bugs from older browsers (primarily IE5). Triggered by missing or malformed DOCTYPEs. Affects CSS box model, table layout, font inheritance, and more.
  • Limited_quirks: Almost standards mode. Only a few specific quirks are applied, mainly affecting table cell vertical alignment.

Recommendation: Always use <!DOCTYPE html> to ensure standards mode.

type encoding = Encoding.encoding =
  1. | Utf8
    (*

    UTF-8: The dominant encoding for the web, supporting all Unicode

    *)
  2. | Utf16le
    (*

    UTF-16 Little-Endian: 16-bit encoding, used by Windows

    *)
  3. | Utf16be
    (*

    UTF-16 Big-Endian: 16-bit encoding, network byte order

    *)
  4. | Windows_1252
    (*

    Windows-1252 (CP-1252): Western European, superset of ISO-8859-1

    *)
  5. | Iso_8859_2
    (*

    ISO-8859-2: Central European (Polish, Czech, Hungarian, etc.)

    *)
  6. | Euc_jp
    (*

    EUC-JP: Extended Unix Code for Japanese

    *)

Character encoding detected or specified.

HTML documents are sequences of bytes that must be decoded into characters. Different encodings interpret the same bytes differently. For example:

  • UTF-8: The modern standard, supporting all Unicode characters
  • Windows-1252: Common on older Western European web pages
  • ISO-8859-2: Used for Central European languages
  • UTF-16: Used by some Windows applications

The parser detects encoding automatically when using parse_bytes. The detected encoding is available via encoding.

type parse_error = Parser.parse_error

A parse error encountered during HTML5 parsing.

HTML5 parsing never fails - the specification defines error recovery for all malformed input. However, conformance checkers can report these errors. Enable error collection with ~collect_errors:true if you want to detect malformed HTML.

Common parse errors:

  • "unexpected-null-character": Null byte in the input
  • "eof-before-tag-name": File ended while reading a tag
  • "unexpected-character-in-attribute-name": Invalid attribute syntax
  • "missing-doctype": Document started without <!DOCTYPE>
  • "duplicate-attribute": Same attribute appears twice on an element

The full list of parse error codes is defined in the WHATWG specification.

val error_code : parse_error -> string

Get the error code string.

Error codes are lowercase with hyphens, matching the WHATWG specification names. Examples: "unexpected-null-character", "eof-in-tag", "missing-end-tag-name".

val error_line : parse_error -> int

Get the line number where the error occurred (1-indexed).

Line numbers count from 1 and increment at each newline character.

val error_column : parse_error -> int

Get the column number where the error occurred (1-indexed).

Column numbers count from 1 and reset at each newline.

type fragment_context = Parser.fragment_context

Context element for HTML fragment parsing (innerHTML).

When parsing HTML fragments (like the innerHTML of an element), you must specify what element would contain the fragment. This affects how the parser handles certain elements.

Why context matters:

HTML parsing rules depend on where content appears. For example:

  • <td> is valid inside <tr> but not inside <div>
  • <li> is valid inside <ul> but creates implied lists elsewhere
  • Content inside <table> has special parsing rules

Example:

  (* Parse as if content were inside a <ul> *)
  let ctx = make_fragment_context ~tag_name:"ul" () in
  let result = parse ~fragment_context:ctx reader
  (* Now <li> elements are parsed correctly *)
val make_fragment_context : tag_name:string -> ?namespace:string option -> unit -> fragment_context

Create a fragment parsing context.

The context element determines how the parser interprets the fragment. Choose a context that matches where the fragment would be inserted.

  • parameter tag_name

    Tag name of the context element (e.g., "div", "tr", "ul"). This is the element that would contain the fragment.

  • parameter namespace

    Namespace of the context element:

    • None (default): HTML namespace
    • Some "svg": SVG namespace
    • Some "mathml": MathML namespace

Examples:

  (* Parse as innerHTML of a <div> (most common case) *)
  let ctx = make_fragment_context ~tag_name:"div" ()

  (* Parse as innerHTML of a <ul> - <li> elements work correctly *)
  let ctx = make_fragment_context ~tag_name:"ul" ()

  (* Parse as innerHTML of an SVG <g> element *)
  let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") ()

  (* Parse as innerHTML of a <table> - table-specific rules apply *)
  let ctx = make_fragment_context ~tag_name:"table" ()
val fragment_context_tag : fragment_context -> string

Get the tag name of a fragment context.

val fragment_context_namespace : fragment_context -> string option

Get the namespace of a fragment context.

type t = {
  1. root : node;
    (*

    Root node of the parsed document tree.

    For full document parsing, this is a Document node containing the DOCTYPE (if any) and <html> element.

    For fragment parsing, this is a Document Fragment containing the parsed elements.

    *)
  2. errors : parse_error list;
    (*

    Parse errors encountered during parsing.

    This list is empty unless ~collect_errors:true was passed to the parse function. Errors are in the order they were encountered.

    *)
  3. encoding : encoding option;
    (*

    Character encoding detected during parsing.

    This is Some encoding when using parse_bytes with automatic encoding detection, and None when using parse (which expects pre-decoded UTF-8 input).

    *)
}

Result of parsing an HTML document.

This record contains everything produced by parsing:

  • The DOM tree (accessible via root)
  • Any parse errors (accessible via errors)
  • The detected encoding (accessible via encoding)

Parsing Functions

val parse : ?collect_errors:bool -> ?fragment_context:fragment_context -> Bytesrw.Bytes.Reader.t -> t

Parse HTML from a Bytes.Reader.t.

This is the primary parsing function. It reads bytes from the provided reader and returns a DOM tree. The input should be valid UTF-8.

Creating readers:

  open Bytesrw

  (* From a string *)
  let reader = Bytes.Reader.of_string html_string

  (* From a file *)
  let ic = open_in "page.html" in
  let reader = Bytes.Reader.of_in_channel ic

  (* From a buffer *)
  let reader = Bytes.Reader.of_buffer buf

Parsing a complete document:

  let result = Html5rw.parse reader
  let doc = Html5rw.root result

Parsing a fragment:

  let ctx = Html5rw.make_fragment_context ~tag_name:"div" () in
  let result = Html5rw.parse ~fragment_context:ctx reader
  • parameter collect_errors

    If true, collect parse errors. Default: false. Error collection has some performance overhead.

  • parameter fragment_context

    Context element for fragment parsing. If provided, the input is parsed as a fragment (like innerHTML) rather than a complete document.

val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string -> ?fragment_context:fragment_context -> bytes -> t

Parse raw bytes with automatic encoding detection.

This function is useful when you have raw bytes and don't know the character encoding. It implements the WHATWG encoding sniffing algorithm:

1. BOM detection: Check for UTF-8, UTF-16LE, or UTF-16BE BOM 2. Prescan: Look for <meta charset="..."> in the first 1024 bytes 3. Transport hint: Use the provided transport_encoding if any 4. Fallback: Use UTF-8 (the modern web default)

The detected encoding is stored in the result's encoding field.

Example:

  let bytes = really_input_bytes ic (in_channel_length ic) in
  let result = Html5rw.parse_bytes bytes in
  match Html5rw.encoding result with
  | Some Utf8 -> print_endline "UTF-8 detected"
  | Some Windows_1252 -> print_endline "Windows-1252 detected"
  | _ -> ()
  • parameter collect_errors

    If true, collect parse errors. Default: false.

  • parameter transport_encoding

    Encoding hint from HTTP Content-Type header. For example, if the server sends Content-Type: text/html; charset=utf-8, pass ~transport_encoding:"utf-8".

  • parameter fragment_context

    Context element for fragment parsing.

Querying

val query : t -> string -> node list

Query the DOM tree with a CSS selector.

CSS selectors are patterns used to select elements in HTML documents. This function returns all nodes matching the selector, in document order.

Supported selectors:

Type selectors:

  • div, p, span - elements by tag name

Class and ID selectors:

  • #myid - element with id="myid"
  • .myclass - elements with class containing "myclass"

Attribute selectors:

  • [attr] - elements with the attr attribute
  • [attr="value"] - attribute equals value
  • [attr~="value"] - attribute contains word
  • [attr|="value"] - attribute starts with value or value-
  • [attr^="value"] - attribute starts with value
  • [attr$="value"] - attribute ends with value
  • [attr*="value"] - attribute contains value

Pseudo-classes:

  • :first-child, :last-child - first/last child of parent
  • :nth-child(n) - nth child (1-indexed)
  • :only-child - only child of parent
  • :empty - elements with no children
  • :not(selector) - elements not matching selector

Combinators:

  • A B - B descendants of A (any depth)
  • A > B - B direct children of A
  • A + B - B immediately after A (adjacent sibling)
  • A ~ B - B after A (general sibling)

Universal:

  • * - all elements

Examples:

  (* All paragraphs *)
  let ps = query result "p"

  (* Elements with class "warning" inside a div *)
  let warnings = query result "div .warning"

  (* Direct children of nav that are links *)
  let nav_links = query result "nav > a"

  (* Complex selector *)
  let items = query result "ul.menu > li:first-child a[href]"
val matches : node -> string -> bool

Check if a node matches a CSS selector.

This is useful for filtering nodes or implementing custom traversals.

Example:

  let is_external_link node =
    matches node "a[href^='http']"

Serialization

val to_writer : ?pretty:bool -> ?indent_size:int -> t -> Bytesrw.Bytes.Writer.t -> unit

Write the DOM tree to a Bytes.Writer.t.

This serializes the DOM back to HTML. The output is valid HTML5 that can be parsed to produce an equivalent DOM tree.

Example:

  open Bytesrw
  let buf = Buffer.create 1024 in
  let writer = Bytes.Writer.of_buffer buf in
  Html5rw.to_writer result writer;
  Bytes.Writer.write_eod writer;
  let html = Buffer.contents buf
  • parameter pretty

    If true (default), add indentation for readability. If false, output compact HTML with no added whitespace.

  • parameter indent_size

    Spaces per indentation level (default: 2). Only used when pretty is true.

val to_string : ?pretty:bool -> ?indent_size:int -> t -> string

Serialize the DOM tree to a string.

Convenience function that serializes to a string instead of a writer. Use to_writer for large documents to avoid memory allocation.

  • parameter pretty

    If true (default), add indentation for readability.

  • parameter indent_size

    Spaces per indentation level (default: 2).

val to_text : ?separator:string -> ?strip:bool -> t -> string

Extract text content from the DOM tree.

This concatenates all text nodes in the document, producing a string with just the readable text (no HTML tags).

Example:

  (* For document: <div><p>Hello</p><p>World</p></div> *)
  let text = to_text result
  (* Returns: "Hello World" *)
  • parameter separator

    String to insert between text nodes (default: " ")

  • parameter strip

    If true (default), trim leading/trailing whitespace

val to_test_format : t -> string

Serialize to html5lib test format.

This produces the tree format used by the html5lib-tests suite. Mainly useful for testing the parser against the reference tests.

Result Accessors

val root : t -> node

Get the root node of the parsed document.

For full document parsing, this returns a Document node. The structure is:

#document
├── !doctype (if present)
└── html
    ├── head
    └── body

For fragment parsing, this returns a Document Fragment node containing the parsed elements directly.

val errors : t -> parse_error list

Get parse errors (if error collection was enabled).

Returns an empty list if ~collect_errors:true was not passed to the parse function, or if the document was well-formed.

Errors are returned in the order they were encountered during parsing.

val encoding : t -> encoding option

Get the detected encoding (if parsed from bytes).

Returns Some encoding when parse_bytes was used, indicating which encoding was detected or specified. Returns None when parse was used, since it expects pre-decoded UTF-8 input.

DOM Utilities

Common DOM operations are available directly on this module. For the full API including more advanced operations, see the Dom module.

val create_element : string -> ?namespace:string option -> ?attrs:(string * string) list -> unit -> node

Create an element node.

Elements are the building blocks of HTML documents. They represent tags like <div>, <p>, <a>, etc.

  • parameter name

    Tag name (e.g., "div", "p", "span")

  • parameter namespace

    Element namespace:

    • None (default): HTML namespace
    • Some "svg": SVG namespace for graphics
    • Some "mathml": MathML namespace for math notation
  • parameter attrs

    Initial attributes as (name, value) pairs

Example:

  (* Simple element *)
  let div = create_element "div" ()

  (* Element with attributes *)
  let link = create_element "a"
    ~attrs:[("href", "/about"); ("class", "nav-link")]
    ()
val create_text : string -> node

Create a text node.

Text nodes contain the readable text content of HTML documents.

Example:

  let text = create_text "Hello, world!"
val create_comment : string -> node

Create a comment node.

Comments are preserved in the DOM but not rendered. They're written as <!-- text --> in HTML.

val create_document : unit -> node

Create an empty document node.

The Document node is the root of an HTML document tree.

val create_document_fragment : unit -> node

Create a document fragment node.

Document fragments are lightweight containers for holding nodes without a parent document. Used for template contents and fragment parsing results.

val create_doctype : ?name:string -> ?public_id:string -> ?system_id:string -> unit -> node

Create a doctype node.

For HTML5 documents, use create_doctype ~name:"html" ().

  • parameter name

    DOCTYPE name (usually "html")

  • parameter public_id

    Public identifier (legacy)

  • parameter system_id

    System identifier (legacy)

val append_child : node -> node -> unit

Append a child node to a parent.

The child is added as the last child of the parent. If the child already has a parent, it is first removed from that parent.

val insert_before : node -> node -> node -> unit

Insert a node before a reference node.

  • parameter parent

    The parent node

  • parameter new_child

    The node to insert

  • parameter ref_child

    The existing child to insert before

Raises Not_found if ref_child is not a child of parent.

val remove_child : node -> node -> unit

Remove a child node from its parent.

Raises Not_found if child is not a child of parent.

val get_attr : node -> string -> string option

Get an attribute value.

Returns Some value if the attribute exists, None otherwise. Attribute names are case-sensitive (but were lowercased during parsing).

val set_attr : node -> string -> string -> unit

Set an attribute value.

If the attribute exists, it is replaced. If not, it is added.

val has_attr : node -> string -> bool

Check if a node has an attribute.

val descendants : node -> node list

Get all descendant nodes in document order.

Returns all nodes below this node in the tree, in the order they appear in the HTML source (depth-first).

val ancestors : node -> node list

Get all ancestor nodes from parent to root.

Returns the chain of parent nodes, starting with the immediate parent and ending with the Document node.

val get_text_content : node -> string

Get text content of a node and its descendants.

For text nodes, returns the text directly. For elements, recursively concatenates all descendant text content.

val clone : ?deep:bool -> node -> node

Clone a node.

  • parameter deep

    If true, recursively clone all descendants. If false (default), only clone the node itself.

Node Predicates

Functions to test what type of node you have.

val is_element : node -> bool

Test if a node is an element.

Elements are HTML tags like <div>, <p>, <a>.

val is_text : node -> bool

Test if a node is a text node.

Text nodes contain character content within elements.

val is_comment : node -> bool

Test if a node is a comment node.

Comment nodes represent HTML comments <!-- ... -->.

val is_document : node -> bool

Test if a node is a document node.

The document node is the root of a complete HTML document tree.

val is_document_fragment : node -> bool

Test if a node is a document fragment.

Document fragments are lightweight containers for nodes.

val is_doctype : node -> bool

Test if a node is a doctype node.

Doctype nodes represent the <!DOCTYPE> declaration.

val has_children : node -> bool

Test if a node has children.