Html5rw.TokenizerHTML5 tokenizer.
The tokenizer is the first stage of HTML5 parsing. It converts a stream of characters into a stream of tokens: start tags, end tags, text, comments, and DOCTYPEs.
Most users don't need to use the tokenizer directly - the parse function handles everything. The tokenizer is exposed for advanced use cases like syntax highlighting or partial parsing.
HTML5 Tokenizer
This module implements the WHATWG HTML5 tokenization algorithm. The tokenizer converts an input byte stream into a sequence of tokens (start tags, end tags, text, comments, doctypes) that can be consumed by a tree builder.
module Token : sig ... endToken types produced by the tokenizer.
module State : sig ... endTokenizer states.
module Errors : sig ... endParse error types.
module Stream : sig ... endInput stream with position tracking.
module type SINK = sig ... endInterface for token consumers.
val create :
(module SINK with type t = 'sink) ->
'sink ->
?collect_errors:bool ->
?xml_mode:bool ->
unit ->
'sink tCreate a new tokenizer.
Run the tokenizer on the given input.
The tokenizer will read from the reader and call the sink's process function for each token until EOF is reached.
val get_errors : 'sink t -> Html5rw__.Tokenizer_errors.t listGet the list of parse errors encountered during tokenization.
Only populated if collect_errors:true was passed to create.
val set_state : 'sink t -> Html5rw__.Tokenizer_state.t -> unitSet the tokenizer state.
Used by the tree builder to switch states for raw text elements.
val set_last_start_tag : 'sink t -> string -> unitSet the last start tag name.
Used by the tree builder to track the context for end tag matching.