Module Html5rw.Tokenizer

HTML5 tokenizer.

The tokenizer is the first stage of HTML5 parsing. It converts a stream of characters into a stream of tokens: start tags, end tags, text, comments, and DOCTYPEs.

Most users don't need to use the tokenizer directly - the parse function handles everything. The tokenizer is exposed for advanced use cases like syntax highlighting or partial parsing.

HTML5 Tokenizer

This module implements the WHATWG HTML5 tokenization algorithm. The tokenizer converts an input byte stream into a sequence of tokens (start tags, end tags, text, comments, doctypes) that can be consumed by a tree builder.

Sub-modules

module Token : sig ... end

Token types produced by the tokenizer.

module State : sig ... end

Tokenizer states.

module Errors : sig ... end

Parse error types.

module Stream : sig ... end

Input stream with position tracking.

Token Sink Interface

module type SINK = sig ... end

Interface for token consumers.

Tokenizer

type 'sink t

The tokenizer type, parameterized by the sink type.

val create : (module SINK with type t = 'sink) -> 'sink -> ?collect_errors:bool -> ?xml_mode:bool -> unit -> 'sink t

Create a new tokenizer.

  • parameter sink

    The token sink that will receive tokens

  • parameter collect_errors

    If true, collect parse errors (default: false)

  • parameter xml_mode

    If true, apply XML compatibility transformations

val run : 'sink t -> (module SINK with type t = 'sink) -> Bytesrw.Bytes.Reader.t -> unit

Run the tokenizer on the given input.

The tokenizer will read from the reader and call the sink's process function for each token until EOF is reached.

val get_errors : 'sink t -> Html5rw__.Tokenizer_errors.t list

Get the list of parse errors encountered during tokenization.

Only populated if collect_errors:true was passed to create.

val set_state : 'sink t -> Html5rw__.Tokenizer_state.t -> unit

Set the tokenizer state.

Used by the tree builder to switch states for raw text elements.

val set_last_start_tag : 'sink t -> string -> unit

Set the last start tag name.

Used by the tree builder to track the context for end tag matching.