Module Html5rw.Encoding

Encoding detection and decoding.

HTML documents can use various character encodings (UTF-8, ISO-8859-1, etc.). This module implements the WHATWG encoding sniffing algorithm that browsers use to detect the encoding of a document:

1. Check for a BOM (Byte Order Mark) 2. Look for a <meta charset> declaration 3. Use HTTP Content-Type header hint (if available) 4. Fall back to UTF-8

HTML5 Encoding Detection and Decoding

This module implements the WHATWG encoding sniffing and decoding algorithms for HTML5 documents. It handles automatic character encoding detection from byte order marks (BOM), meta charset declarations, and transport layer hints.

Encoding Detection Algorithm

The encoding detection follows the WHATWG specification: 1. Check for a BOM (UTF-8, UTF-16LE, UTF-16BE) 2. Prescan for <meta charset> or <meta http-equiv="content-type"> 3. Use transport layer encoding hint if provided 4. Fall back to UTF-8 as the default

Types

type encoding =
  1. | Utf8
    (*

    UTF-8 encoding (default)

    *)
  2. | Utf16le
    (*

    UTF-16 little-endian

    *)
  3. | Utf16be
    (*

    UTF-16 big-endian

    *)
  4. | Windows_1252
    (*

    Windows-1252 (Latin-1 superset)

    *)
  5. | Iso_8859_2
    (*

    ISO-8859-2 (Central European)

    *)
  6. | Euc_jp
    (*

    EUC-JP (Japanese)

    *)

Character encodings supported by the parser.

The HTML5 specification requires support for a large number of encodings, but this implementation focuses on the most common ones. Other encodings are mapped to their closest equivalent.

Encoding Utilities

val encoding_to_string : encoding -> string

Convert an encoding to its canonical label string.

Returns the WHATWG canonical name, e.g., "utf-8", "utf-16le".

val sniff_bom : bytes -> (encoding * int) option

Detect encoding from a byte order mark.

Examines the first bytes of the input for a BOM and returns the detected encoding with the number of bytes to skip.

  • returns

    (Some (encoding, skip_bytes)) if a BOM is found, None otherwise.

val normalize_label : string -> encoding option

Normalize an encoding label to its canonical form.

Maps encoding labels (case-insensitive, with optional whitespace) to the supported encoding types.

  • returns

    Some encoding if the label is recognized, None otherwise.

  normalize_label "UTF-8"      (* Some Utf8 *)
  normalize_label "utf8"       (* Some Utf8 *)
  normalize_label "latin1"     (* Some Windows_1252 *)
val prescan_for_meta_charset : bytes -> encoding option

Prescan bytes to find a meta charset declaration.

Implements the WHATWG prescan algorithm that looks for encoding declarations in the first 1024 bytes of an HTML document.

  • returns

    Some encoding if a meta charset is found, None otherwise.

Decoding

val decode : bytes -> ?transport_encoding:string -> unit -> string * encoding

Decode raw bytes to a UTF-8 string with automatic encoding detection.

This function implements the full encoding sniffing algorithm: 1. Check for BOM 2. Prescan for meta charset 3. Use transport encoding hint if provided 4. Fall back to UTF-8

  • parameter transport_encoding

    Encoding hint from HTTP Content-Type header

  • returns

    (decoded_string, detected_encoding)

  let (html, enc) = decode raw_bytes ()
  (* html is now a UTF-8 string, enc is the detected encoding *)