Module Html5rw.Entities

HTML entity decoding.

HTML uses character references to represent characters that are hard to type or have special meaning:

This module decodes all 2,231 named character references defined in the WHATWG specification, plus numeric references.

HTML5 Named Character Reference Decoding

This module provides functions for decoding HTML5 named character references (entities) and numeric character references. It includes the complete table of 2,231 named character references defined in the WHATWG HTML5 specification.

Character Reference Types

HTML5 supports three types of character references:

Named References

Decimal Numeric References

Hexadecimal Numeric References

Legacy Entity Handling

Some named entities are "legacy" - they were supported without a trailing semicolon in older browsers (e.g., &amp instead of &). The parser handles these according to the WHATWG specification.

Decoding Functions

val decode : string -> in_attribute:bool -> string

Decode all character references in a text string.

Processes the string and replaces all valid character references (named and numeric) with their decoded UTF-8 equivalents.

  decode "Hello & goodbye"
  (* Returns: "Hello & goodbye" *)

  decode "<script>"
  (* Returns: "<script>" *)
val decode_numeric : string -> is_hex:bool -> string option

Decode a numeric character reference.

  • parameter codepoint

    The Unicode codepoint to decode

  • returns

    The UTF-8 string representation

Note: Some codepoints are replaced according to the HTML5 specification (e.g., control characters in the 0x80-0x9F range are mapped to Windows-1252 equivalents).

val lookup : Stdlib.String.t -> string option

Look up a named character reference.

  • parameter name

    The entity name without & and ; (e.g., "amp")

  • returns

    Some codepoints if the entity exists, None otherwise

  lookup "amp"    (* Some [0x26] *)
  lookup "nbsp"   (* Some [0xA0] *)
  lookup "bogus"  (* None *)
val is_legacy : Stdlib.String.t -> bool

Check if an entity is a legacy entity.

Legacy entities are those that were historically recognized without a trailing semicolon. The parser handles these specially to maintain browser compatibility.

  is_legacy "amp"   (* true - &amp works without ; *)
  is_legacy "nbsp"  (* true *)
  is_legacy "Aacute" (* false - requires semicolon *)
val codepoint_to_utf8 : int -> string

Convert a Unicode codepoint to its UTF-8 encoding.

  • parameter codepoint

    The Unicode codepoint (0 to 0x10FFFF)

  • returns

    The UTF-8 encoded string

Sub-modules

module Numeric_ref : sig ... end

Numeric character reference handling.