Html5rw.EncodingEncoding detection and decoding.
HTML documents can use various character encodings (UTF-8, ISO-8859-1, etc.). This module implements the WHATWG encoding sniffing algorithm that browsers use to detect the encoding of a document:
1. Check for a BOM (Byte Order Mark) 2. Look for a <meta charset> declaration 3. Use HTTP Content-Type header hint (if available) 4. Fall back to UTF-8
HTML5 Encoding Detection and Decoding
This module implements the WHATWG encoding sniffing and decoding algorithms for HTML5 documents. It handles automatic character encoding detection from byte order marks (BOM), meta charset declarations, and transport layer hints.
The encoding detection follows the WHATWG specification: 1. Check for a BOM (UTF-8, UTF-16LE, UTF-16BE) 2. Prescan for <meta charset> or <meta http-equiv="content-type"> 3. Use transport layer encoding hint if provided 4. Fall back to UTF-8 as the default
Character encodings supported by the parser.
The HTML5 specification requires support for a large number of encodings, but this implementation focuses on the most common ones. Other encodings are mapped to their closest equivalent.
val encoding_to_string : encoding -> stringConvert an encoding to its canonical label string.
Returns the WHATWG canonical name, e.g., "utf-8", "utf-16le".
val sniff_bom : bytes -> (encoding * int) optionDetect encoding from a byte order mark.
Examines the first bytes of the input for a BOM and returns the detected encoding with the number of bytes to skip.
val normalize_label : string -> encoding optionNormalize an encoding label to its canonical form.
Maps encoding labels (case-insensitive, with optional whitespace) to the supported encoding types.
normalize_label "UTF-8" (* Some Utf8 *)
normalize_label "utf8" (* Some Utf8 *)
normalize_label "latin1" (* Some Windows_1252 *)val prescan_for_meta_charset : bytes -> encoding optionPrescan bytes to find a meta charset declaration.
Implements the WHATWG prescan algorithm that looks for encoding declarations in the first 1024 bytes of an HTML document.
val decode : bytes -> ?transport_encoding:string -> unit -> string * encodingDecode raw bytes to a UTF-8 string with automatic encoding detection.
This function implements the full encoding sniffing algorithm: 1. Check for BOM 2. Prescan for meta charset 3. Use transport encoding hint if provided 4. Fall back to UTF-8
let (html, enc) = decode raw_bytes ()
(* html is now a UTF-8 string, enc is the detected encoding *)